# Final Project Logistic Regression Model

# Datascience DIKU 2021

This notebook details our work on our Logistic Regression model. The variable model_list can be swapped out, such that it trains random forest, gradient boosting, multinomial naive bayes, ada boost and decision tree classifier.

## User guide
The notebook requires the LoadingAndCleaning.py file, as well as data from FakeNewsCorpus and some scraped wikinews, which can be turned off. We also use data from the LIAR set, and a Kaggle competition to get predictions after our training. With the right data, the notebook can be run from start to finish, using some booleans at the start to control important parameters.

## Cleaning
 The support py module does various cleaning. Importantly, it drops "unknown" types, and articles longer than 24000 chars. It maps "reliable" and "political" to genuine news and all else to fake. It includes lowering, contraction expansion, generic whitespace truncation and removal of odd characters. We also replace emails, dates, urls and nums with e.g. NUMNUM. We tokenize and do stopword removal as well, before recombining our content to a processed string for TF IDF fitting and transform. We also append some scraped wikinews we have, which can be turned off.

## Training
We do a default 25/75 train/test split. We use optimized C and max iter values for Log. reg., see the bottom of the notebook for sklearns gridsearch.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import time
import pandas as pd
import numpy as np
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from multiprocessing import Pool
from sklearn.model_selection import GridSearchCV

from LoadingAndCleaning import init_dataframe
import LoadingAndCleaning
import Vizualisations
import LoadingAndCleaningNoMultiProcessing

In [5]:
def load_and_clean_wikinews():
    wiki_df = pd.read_csv('data/wiki_data.csv', index_col=[0])
    wiki_df['url'] = 0
    wiki_df['scraped_at'] = 0
    wiki_df['type'] = 1
    wiki_df['domain'] = 0
    wiki_df['authors'] = ""
    wiki_df['content'] = wiki_df['content'].astype('str')
    wiki_df = LoadingAndCleaning.trim_df_only_content(wiki_df)
    wiki_df = LoadingAndCleaning.clean_df_only_content(wiki_df)
    wiki_df['label'] = 1
    wiki_df = LoadingAndCleaning.process_df_only_content(wiki_df)
    
    return wiki_df

In [6]:
def adjust_df_labels_remove_fake(df): 
    df_real = df[df['label'] == 1]
    df_fake = df[df['label'] == 0]
    min_len = len(df_real)
    df_adjusted = df_real.append(df_fake.sample(n=min_len))
    
    return df_adjusted
    

# 1.1 If only using content, for Kaggle and LIAR


In [7]:
load_path = 'data/1mio-raw.csv'
percent_load = 0.1
use_multiprocessing = False
is_clean_authors_and_title = False # cleaned and processed like content, ie tokenized
use_wiki = True

save_kaggle_submission = True
kaggle_path = 'data/submissions/base_model_preds.csv'

##  Initialize, trim, clean, process and balance label distribution

In [8]:
# Init and trim dataframe
df = init_dataframe(load_path, percent_load)
df['content'] = df['content'].astype('str')
df = LoadingAndCleaning.trim_df_only_content(df)

if not use_wiki:
    df = adjust_df_labels_remove_fake(df)  
df = LoadingAndCleaning.clean_df_only_content(df, use_multiprocessing)
df = LoadingAndCleaning.process_df_only_content(df, use_multiprocessing)

if use_wiki:
    wiki_news = load_and_clean_wikinews()
    df = df.append(wiki_news, ignore_index=True)
    df = adjust_df_labels_remove_fake(df)  

Initializing dataframe...
Initialized data (100594 entries) in: 5.8 seconds
Cleaning...
Processing...
Processed(66036 entries) in: 85.1 seconds
Cleaning...
Cleaned (3323 entries) in: 1.9 seconds
Processing...
Processed(3323 entries) in: 3.2 seconds


# 1.2 Lets look at few data properties

In [53]:
df.head(3)

Unnamed: 0,type,content,label,content_joined,content_length
6,reliable,"[screenshot, youtube, mjoflakelandNUMNUM, metr...",1,screenshot youtube mjoflakelandNUMNUM metro bu...,360
14,reliable,"[photo, reutersbrian, snyder, demonstrators, h...",1,photo reutersbrian snyder demonstrators holdin...,303
18,reliable,"[much, fanfare, publicity, january, declared, ...",1,much fanfare publicity january declared human ...,55


In [55]:
if use_wiki:
    display(wiki_news.head(3))

Unnamed: 0,type,content,label,content_joined,content_length
0,1,"[free, trade, zone, antalya, turkey, NUMNUM, w...",1,free trade zone antalya turkey NUMNUM women wo...,536
1,1,"[three, police, officers, NUMNUM, batalhãof, d...",1,three police officers NUMNUM batalhãof da pm m...,307
2,1,"[novak, story, sparked, contoversy, written, j...",1,novak story sparked contoversy written july NU...,315


#### Check label distribution

In [11]:
# Check counts
df['label'].value_counts()

1    29258
0    29258
Name: label, dtype: int64

#### Check length of df 

In [12]:
# Length of working dataframe
len(df)

58516

# 2. Define Term Frequency Inverse Document Frequency (TF IDF) 

In [16]:
def run_tf_idf(train_df, test_df=None, use_title_and_author=False):
    
    if use_title_and_author:
        feature = 'all_info'
    else: 
        feature = 'content_joined'
        
    # get targe<t values
    train_target = train_df['label'].values
    
    # Prepare the tf-idf (term frequency-inverse document frequency) TODO: read up on this for report
    start_time = time.time() 
    count_vectorizer = CountVectorizer(ngram_range=(1, 2))
    tf_idf_transformer = TfidfTransformer(smooth_idf=False)

    # fit and transform train data to count vectorizer
    count_vectorizer.fit(train_df[feature].values)
    count_vect_train = count_vectorizer.transform(train_df[feature].values)
    # fit the counts vector to tfidf transformer
    tf_idf_transformer.fit(count_vect_train)
    tf_idf_train = tf_idf_transformer.transform(count_vect_train)

    # Transform the test data as well
    # NB! We dont use this?
    if test_df is not None:
        count_vect_test = count_vectorizer.transform(test_df[feature].values)
        tf_idf_test = tf_idf_transformer.transform(count_vect_test)

    end_time = time.time()
    print("Time elapsed for TF IDF transform: ,", end_time - start_time)

    X_train, X_test, y_train, y_test = train_test_split(tf_idf_train, train_target, random_state=0)
    
    return X_train, X_test, y_train, y_test, tf_idf_transformer, count_vectorizer

# 3. Define Classifiers for Training and Validating

In [17]:
# Setting up model metric dataframe and model fitting
# TODO: Training should happen in .py file, preferably pooled
#######################
def get_perf_metrics(model, i, X_train, y_train, X_test, y_test):
    # model name
    model_name = type(model).__name__
    # time keeping
    start_time = time.time()
    print("Training {} model...".format(model_name))
    # Fitting of model
    
    # list to retain the models to use later for test set predictions
    models_trained_list = []
    
    model.fit(X_train, y_train)
    print("Completed {} model training.".format(model_name))
    elapsed_time = time.time() - start_time
    # Time Elapsed
    print("Time elapsed: {:.2f} s.".format(elapsed_time))
    # Predictions
    y_pred = model.predict(X_test)
    # Add to ith row of dataframe - metrics
    df_perf_metrics.loc[i] = [
        model_name,
        model.score(X_train, y_train),
        model.score(X_test, y_test),
        precision_score(y_test, y_pred),
        recall_score(y_test, y_pred),
        f1_score(y_test, y_pred), "{:.2f}".format(elapsed_time)
    ]
    # keep a track of trained models
    models_trained_list.append(model)

    print("Completed {} model's performance assessment.".format(model_name))

 # 4. Run the TF IDF tranforms, use default 3/4 split

In [18]:
# Do the tf idf transform
# Note we save the transformer and vectorizer, because we need this to transform test data in similar way
X_train, X_test, y_train, y_test, tf_idf_transformer, count_vectorizer = run_tf_idf(df,
                                                                                    test_df=None,
                                                                                    use_title_and_author=False)

Time elapsed for TF IDF transform: , 99.30678725242615


# 5. Training models

In [20]:
df_perf_metrics = pd.DataFrame(columns=[
                'Model', 'Accuracy_Training_Set', 
                'Accuracy_Test_Set', 'Precision',
                'Recall', 'f1_score', 'Training Time (secs)'
                ])


# Set up model list and run training
#C=275, max_iter=185
#models_list = [LogisticRegression(C=200, max_iter=200)]
models_list = [LogisticRegression()]

#models_list = [LogisticRegression(), 
#               MultinomialNB(),
#               RandomForestClassifier(),
#               DecisionTreeClassifier(),
#               GradientBoostingClassifier(),
#               AdaBoostClassifier()]

train_models = True
if train_models: 
    for n, model in enumerate(models_list):
        get_perf_metrics(model, n, X_train, y_train, X_test, y_test)

Training LogisticRegression model...
Completed LogisticRegression model training.
Time elapsed: 101.88 s.
Completed LogisticRegression model's performance assessment.


# 5.1 Show results

In [21]:
# Look at metrics for all models
df_perf_metrics

Unnamed: 0,Model,Accuracy_Training_Set,Accuracy_Test_Set,Precision,Recall,f1_score,Training Time (secs)
0,LogisticRegression,0.944972,0.839087,0.812244,0.877178,0.843463,101.88


# 6. Consider saving model

In [22]:
def save_trained_model(pkl_filename):
    import pickle
    # Save to file in the current working directory
    # The model: "models/log_reg_optimized_100percent_data.pkl" has the following params:
    # 100% data (394k rows after cleaning), C: 150, max iter 200, added title and author
    # tested on political/reliable split, balanced by dropping nans
    #pkl_filename = "models/base_model_not_dropped_nans.pkl"
    with open(pkl_filename, 'wb') as file:
        pickle.dump(model, file)

In [23]:
def load_trained_model(pkl_filename):
    with open(pkl_filename, 'rb') as file:
        loaded_model = pickle.load(file)
    return loaded_model

# 7. Prepare and test on LIAR
We map true and mostly true to "real".

In [24]:
def prepare_liar():
    liar_path = "data/liar_dataset/test.tsv"
    liar_df = pd.read_csv(liar_path, sep='\t', names=["ID", "label", "statement", "subject", "speaker", "job title", "state ", "party ","total credit", "barely true counts","false counts", "half true counts", "mostly true counts", "pants on fire counts","context "])
    liar_df['custom_label'] = np.where(((liar_df['label'] == 'true') | (liar_df['label'] == 'mostly-true')), 1, 0)
    liar_df = liar_df.dropna(subset=['label', 'statement'])
    liar_df['statement'] = liar_df['statement'].str.strip()
    liar_df = liar_df.drop_duplicates(subset="statement") 
    liar_df["statement"] = pd.concat(map(LoadingAndCleaning.clean_content, np.array_split(liar_df["statement"], 32)))
    liar_df["statement"] = pd.concat(map(LoadingAndCleaning.clean_content, np.array_split(liar_df["statement"], 32)))
    liar_df["statement"] = pd.concat(map(LoadingAndCleaning.tokenize, np.array_split(liar_df["statement"], 32)))
    liar_df["statement"] = pd.concat(map(LoadingAndCleaning.remove_stopwords, np.array_split(liar_df["statement"], 32)))
    liar_df["statement_joined"] = pd.concat(map(LoadingAndCleaning.recombine_content, np.array_split(liar_df["statement"], 32)))
    liar_df["statement_len"] = pd.concat(map(LoadingAndCleaning.add_content_length, np.array_split(liar_df["statement"], 32)))
    return liar_df


In [25]:
def run_tf_idf_for_test_prep(df, tf_idf_transformer, count_vectorizer, feature, target = False):

    # We only have target for LIAR, not kaggle
    if target:
        test_target = liar_df['custom_label'].values
    
    # Prepare the tf-idf (term frequency-inverse document frequency) TODO: read up on this for report
    start_time = time.time()
    
    count_vect_test = count_vectorizer.transform(df[feature].values)
    tf_idf_test = tf_idf_transformer.transform(count_vect_test)

    end_time = time.time()
    print("Time elapsed for TF IDF transform: ,", end_time - start_time)

    if target:
        return tf_idf_test, test_target
    else:
        return tf_idf_test

In [26]:
liar_df = prepare_liar()

In [27]:
liar_df['statement_len'].mean()

11.26440410418311

In [28]:
liar_df.head(2)

Unnamed: 0,ID,label,statement,subject,speaker,job title,state,party,total credit,barely true counts,false counts,half true counts,mostly true counts,pants on fire counts,context,custom_label,statement_joined,statement_len
0,11972.json,True,"[building, wall, yousmexico, border, take, lit...",immigration,rick-perry,Governor,Texas,republican,30,30,42,23,18,Radio interview,,1,building wall yousmexico border take literally...,7
1,11685.json,False,"[wisconsin, pace, double, number, layoffs, year]",jobs,katrina-shankland,State representative,Wisconsin,democrat,2,1,0,0,0,a news conference,,0,wisconsin pace double number layoffs year,6


### 7.2 Run the TF IDF transform on LIAR ste

In [29]:
LIAR_tf_idf_test, LIAR_train_target = run_tf_idf_for_test_prep(liar_df,
                                                               tf_idf_transformer,
                                                               count_vectorizer,
                                                               feature = "statement_joined",
                                                               target=True)

Time elapsed for TF IDF transform: , 0.14763641357421875


In [30]:
LIAR_predictions = model.predict(LIAR_tf_idf_test)

In [31]:
# Real distribution
print(df['label'].value_counts() * 100/len(df))

# Now print our predicted distribution
print(np.bincount(LIAR_predictions)* 100 / len(LIAR_predictions))

1    50.0
0    50.0
Name: label, dtype: float64
[16.96921863 83.03078137]


### 7.3 Look at LIAR preds

In [32]:
print('Accuracy: ', accuracy_score(LIAR_train_target, LIAR_predictions))
print('Precision: ', precision_score(LIAR_train_target, LIAR_predictions))
print('Recall: ', recall_score(LIAR_train_target, LIAR_predictions))
print('f1-score: ', f1_score(LIAR_train_target, LIAR_predictions))
#fuck

Accuracy:  0.3898973954222573
Precision:  0.34600760456273766
Recall:  0.8106904231625836
f1-score:  0.4850099933377749


# 8. Prepare Kaggle set for prediction

In [33]:
# Adjustable cleaner
def custom_clean(df, feature):
    feature_joined = feature + "_joined"
    df[feature] = pd.concat(map(LoadingAndCleaning.clean_content, np.array_split(df[feature], 32)))
    df[feature] = pd.concat(map(LoadingAndCleaning.tokenize, np.array_split(df[feature], 32)))
    df[feature] = pd.concat(map(LoadingAndCleaning.remove_stopwords, np.array_split(df[feature], 32)))
    df[feature_joined] = pd.concat(map(LoadingAndCleaning.recombine_content, np.array_split(df[feature], 32)))
    return df

In [34]:
def prepare_kaggle():
    kaggle_path = "data/kaggle_test_data/test_set.json"
    kaggle_df = pd.read_json(kaggle_path)
    kaggle_df = custom_clean(kaggle_df, "article")
    return kaggle_df

In [35]:
kaggle_df = prepare_kaggle()

In [36]:
kaggle_df.head()

Unnamed: 0,id,article,article_joined
0,8476,"[daniel, greenfield, shillman, journalism, fel...",daniel greenfield shillman journalism fellow f...
1,10294,"[google, pinterest, digg, linkedin, reddit, st...",google pinterest digg linkedin reddit stumbleu...
2,3608,"[yous, secretary, state, john, f, kerry, said,...",yous secretary state john f kerry said monday ...
3,10142,"[kaydee, king, kaydeeking, november, NUMNUM, N...",kaydee king kaydeeking november NUMNUM NUMNUM ...
4,875,"[primary, day, new, york, frontrunners, hillar...",primary day new york frontrunners hillary clin...


### 8.1 TF IDF Transform Kaggle

In [37]:
kaggle_tf_idf_test = run_tf_idf_for_test_prep(kaggle_df, tf_idf_transformer, count_vectorizer, feature='article_joined')

Time elapsed for TF IDF transform: , 5.432998180389404


### 8.2 Get predictions and print scores

In [38]:
kaggle_predictions = model.predict(kaggle_tf_idf_test)

In [39]:
# Real distribution
print(df['label'].value_counts() * 100/len(df))

# Now print our predicted distribution

print("Our predicted distribution of Kaggle's real/fake: ", np.bincount(kaggle_predictions)* 100 / len(kaggle_predictions))

1    50.0
0    50.0
Name: label, dtype: float64
Our predicted distribution of Kaggle's real/fake:  [30.86029992 69.13970008]


In [40]:
def get_kaggle_submission(kaggle_df, kaggle_predictions):
    kaggle_submission = kaggle_df
    kaggle_submission['label'] = kaggle_predictions.tolist()
    kaggle_submission = kaggle_submission.drop(['article', 'article_joined'], axis=1)
    kaggle_submission['label'] = np.where(((kaggle_submission['label'] == 1)), "REAL", "FAKE")
    return kaggle_submission

In [41]:
kaggle_submission = get_kaggle_submission(kaggle_df, kaggle_predictions)

In [42]:
kaggle_submission.head()

Unnamed: 0,id,label
0,8476,REAL
1,10294,REAL
2,3608,REAL
3,10142,REAL
4,875,REAL


# 8.4 Save kaggle preds into submission folder 

In [43]:
if save_kaggle_submission:
    #kaggle_path = 'data/submissions/base_model_preds.csv'
    kaggle_submission.to_csv(kaggle_path, index=False)

# 9. Tuning logistic regression C and max iter (Found C: 275, max iter: 185 optim)

In [44]:

#max_iter = [200, 250, 300]
#C = [75, 100, 125, 150]
# Optimal C: 150, max iter 200

# So, we found the lowest max iter, and highest C to be optimal, and can retune with lower/higher values
# Problem, we get this max iter error when using lower max iter... But we try retuning

In [45]:
#max_iter = [175, 185, 200]
#C = [150, 175, 200]

#Retuning params, we got optimal max iter: 185, and C: 200
#So, we try using max iter: 175, and then bigger C values

In [46]:
#max_iter = [185]
#C = [200, 250, 275]

# With these params, we got an optimal of 185 max iter, and 275 C

In [47]:
#max_iter = [200]
#C = [275]

#param_grid = dict(max_iter=max_iter, C=C)

In [48]:
def gridsearch_C_and_maxiter(C, max_iter):
#max_iter = [200]
#C = [275]
    model = LogisticRegression()
    param_grid = dict(max_iter=max_iter, C=C)
    grid = GridSearchCV(estimator=model,
                        param_grid=param_grid,
                        cv=5,
                        scoring=['f1'],
                        refit='f1',
                        verbose=2)
    
    grid_result = grid.fit(X_train, y_train)
    
    return grid_result
    
    #best_model = grid_result.best_estimator_

In [49]:
#max_iter = [200]
#C = [275]
# Run gridsearch:
#grid_result = gridsearch_C_and_maxiter(C, max_iter)

## 9.1 Print performance of best C and max iter values

In [50]:
# Best C and max iter vals
#print(grid_result.best_params_)

In [51]:
def print_performance(grid_result):
    model = grid_result.best_estimator_

    y_pred = model.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Precision: ', precision_score(y_test, y_pred))
    print('Recall: ', recall_score(y_test, y_pred))
    print('f1-score: ', f1_score(y_test, y_pred))

In [52]:
#print_performance(grid_result)