The next step is to evaluate different parameters for basic models with our data, as opposed to more complex models such as neural networks. I imported the necessary libraries below and using the data we had already collected at the time began to try and find good parameters to use.

In [2]:
import json
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
import scipy
import zipfile

In [6]:
# df = pd.read_json("../../data/News_Category_Dataset_v2.json", lines = True)


zf = zipfile.ZipFile('../../data/dffulltext.csv.zip') 
df = pd.read_csv(zf.open('dffulltext.csv'))

In [3]:
df.drop(["date", "Unnamed: 0"], axis = 1, inplace = True)

In [4]:
df.head()

Unnamed: 0,category,headline,authors,link,short_description
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ..."


In [5]:
df["category"].value_counts()

POLITICS          32739
WELLNESS          17827
ENTERTAINMENT     16058
TRAVEL             9887
STYLE & BEAUTY     9649
PARENTING          8677
HEALTHY LIVING     6694
QUEER VOICES       6314
FOOD & DRINK       6226
BUSINESS           5937
COMEDY             5175
SPORTS             4884
BLACK VOICES       4528
HOME & LIVING      4195
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3651
WOMEN              3490
IMPACT             3459
DIVORCE            3426
CRIME              3405
MEDIA              2815
WEIRD NEWS         2670
GREEN              2622
WORLDPOST          2579
RELIGION           2556
STYLE              2254
SCIENCE            2178
WORLD NEWS         2177
TASTE              2096
TECH               2082
MONEY              1707
ARTS               1509
FIFTY              1401
GOOD NEWS          1398
ARTS & CULTURE     1339
ENVIRONMENT        1323
COLLEGE            1144
LATINO VOICES      1129
CULTURE & ARTS     1030
EDUCATION          1004
Name: category, 

In [6]:
df_2 = pd.read_csv("../../data/dffulltext.csv").drop("Unnamed: 0", axis = 1)

In [7]:
df_2.dropna(subset = ["full_text"], inplace = True)

In [8]:
df_2

Unnamed: 0,category,headline,authors,link,short_description,date,full_text
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26,Mei-Chun Jau for HuffPost Amanda Painter is th...
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26,The 2018 FIFA World Cup starts June 14 in Russ...
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26,Love actually turned to matrimony for Hugh Gra...
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26,Rep. Adam Schiff (D-Calif.) and fellow Democra...
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26,The “Dietland” star told host Jimmy Fallon tha...
...,...,...,...,...,...,...,...
40378,ARTS & CULTURE,Book Publishers Are Scrambling To Release Trum...,Maddie Crum,https://www.huffingtonpost.com/entry/trump-sur...,Several forthcoming new books center on coping...,2016-12-09,"The day after Donald Trump’s election win, man..."
40379,BLACK VOICES,Second NFL Player Targeted By Stomach-Turning ...,Ryan Grenoble,https://www.huffingtonpost.com/entry/brandon-m...,Broncos linebacker Brandon Marshall said he wa...,2016-12-09,Justin Edmonds via Getty Images Brandon Marsha...
40380,POLITICS,Senate Democrats Give Up On Coal Miner Health ...,Laura Barrón-López,https://www.huffingtonpost.com/entry/governmen...,Sen. Joe Manchin said he hopes to enlist Presi...,2016-12-09,Coal-state Senate Democrats on Friday backed o...
40381,POLITICS,Courtesy Over Death Penalty Cases May Be Dead ...,Cristian Farias,https://www.huffingtonpost.com/entry/supreme-c...,The justices agreed to stay one man's executio...,2016-12-09,"Days before the presidential election, Chief J..."


The first thing I had to do with the data was clean the full text we had gathered. This involved removing all capital letters and punctuation. But also with natural language processing we want to determine what words a computer does not need to look at while determining categorization. I did this here and wrote some code to pull all of the website links from the article text as they were unique nonsense words a machine wouldn't be able to categorize well.

In [9]:
df_2['lower_text'] = df_2['full_text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [10]:
df_2["lower_text"].head()

0    mei-chun jau for huffpost amanda painter is th...
1    the 2018 fifa world cup starts june 14 in russ...
2    love actually turned to matrimony for hugh gra...
3    rep. adam schiff (d-calif.) and fellow democra...
4    the “dietland” star told host jimmy fallon tha...
Name: lower_text, dtype: object

In [11]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [12]:
df_2["lower_text"] = df_2["lower_text"].str.replace('[^\w\s]', "")

In [13]:
stop = stopwords.words("english")

In [14]:
df_2['lower_text'] = df_2['lower_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

In [15]:
df_2["lower_text"].head()

0    meichun jau huffpost amanda painter sole survi...
1    2018 fifa world cup starts june 14 russia offi...
2    love actually turned matrimony hugh grant 57ye...
3    rep adam schiff dcalif fellow democrats better...
4    dietland star told host jimmy fallon uses trum...
Name: lower_text, dtype: object

In [16]:
df_2["category"] = df_2["category"].str.lower()

In [18]:
def tfidf(X, y,  stopwords_list): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    vectorizer = TfidfVectorizer(stop_words=stopwords_list)
    tf_idf_train = vectorizer.fit_transform(X_train)
    tf_idf_test = vectorizer.transform(X_test)
    return tf_idf_train, tf_idf_test, y_train, y_test, vectorizer

In [19]:
# stopwords_list = stopwords.words('english') 
# idf_train, idf_test, y_tr, y_t, vectorizer = tfidf(X, y, stopwords_list)

In [20]:
df_3 = pd.DataFrame()

In [21]:
# df_3["lower_text"] = df_2["lower_text"].apply(lambda x: x.replace("  ", " ").replace("   ", " ").replace("    ", " ").replace("     ", " "))

In [22]:
df_3["lower_text"] = df_2["lower_text"].apply(lambda x: x.split(" "))

In [23]:
df_3["category"] = df_2["category"]

I built a seperate dataframe to contain a single row per word found in a text in order to determine what words were used very infrequently, or frequently, that the computer wouldn't recieve value from by noticing.

In [24]:
df_3

Unnamed: 0,lower_text,category
0,"[meichun, jau, huffpost, amanda, painter, sole...",crime
1,"[2018, fifa, world, cup, starts, june, 14, rus...",entertainment
2,"[love, actually, turned, matrimony, hugh, gran...",entertainment
3,"[rep, adam, schiff, dcalif, fellow, democrats,...",entertainment
4,"[dietland, star, told, host, jimmy, fallon, us...",entertainment
...,...,...
40378,"[day, donald, trumps, election, win, many, ame...",arts & culture
40379,"[justin, edmonds, via, getty, images, brandon,...",black voices
40380,"[coalstate, senate, democrats, friday, backed,...",politics
40381,"[days, presidential, election, chief, justice,...",politics


In [25]:
df_3 = df_3.explode("lower_text")

In [26]:
df_3.head()

Unnamed: 0,lower_text,category
0,meichun,crime
0,jau,crime
0,huffpost,crime
0,amanda,crime
0,painter,crime


In the cell below we can see that the articles contain "words" which contain nothing but a picture link from twitter. This is an example of a "word" that only occurs once and as a category can be searched out and removed from our articles entirely.

In [27]:
df_3.lower_text.value_counts()

said                       106009
trump                       76536
people                      66456
would                       59388
us                          56237
                            ...  
pictwittercomxiygv4syuq         1
nonalternative                  1
sklars                          1
roomtoroom                      1
gusisnotforus                   1
Name: lower_text, Length: 246874, dtype: int64

In [28]:
pictwitter = df_3["lower_text"].loc[df_3["lower_text"].str.startswith("pictwitter")]

In [29]:
df_3 = df_3.loc[df_3["lower_text"].str.startswith("pictwitter") == False]

In [30]:
df_3["lower_text"].loc[df_3["lower_text"].str.contains("http")]

6        hrefhttpenrocketnews24com20170307mcdonaldsjapa...
19       httpswwwscotsmancomnewspoliticstrumpturnberryp...
19       httpswwwscotsmancomnewspoliticstrumpturnberryp...
19       httpswwwscotsmancomnewspoliticstrumpturnberryp...
19       httpswwwscotsmancomnewspoliticstrumpturnberryp...
                               ...                        
40251                                   httpstcoc55yokslif
40280    hrefhttpswwwhuffpostcomentryilhanomarelectedto...
40280    hrefhttpwwwstartribunecomilhanomarwillbenation...
40280    hrefhttpswwwhuffpostcomentrypccccandidates2016...
40291                      httpsfundrazrcomcampaigns11b5z8
Name: lower_text, Length: 6160, dtype: object

In [31]:
http = df_3["lower_text"].loc[df_3["lower_text"].str.contains("http")]

In [32]:
df_3 = df_3.loc[df_3["lower_text"].str.contains("http") == False]

In [33]:
df_3["lower_text"].value_counts().tail(30)

42west             1
eckes              1
steamgrill         1
breathoflife       1
compradorist       1
layhani            1
boles              1
decs               1
candrashekar       1
baams              1
sourcewater        1
inman              1
grothmans          1
laberge            1
josephsteinberg    1
latuffcartoons     1
stagehands         1
raouf              1
arzobispo          1
multioctave        1
suna               1
nrakept            1
armors             1
wfmztv             1
swolfley79         1
russianarmenian    1
fromit             1
pfas               1
penneast           1
gusisnotforus      1
Name: lower_text, dtype: int64

In [34]:
df_3["lower_text"].loc[df_3["lower_text"].str.startswith("www") == True]

304                                      wwwregulationsgov
628                                        wwwbretbaiercom
2357                                    wwwgodhatesfagscom
3460                                        wwwjoesoukicom
4022                                     wwwregulationsgov
                               ...                        
40231                         wwwmonumentsmenfoundationorg
40234                                         wwwlascauxfr
40243    wwwbillboardcomarticleseventswomeninmusic76168...
40303                             wwwlorrainedevonwilkecom
40305                                  wwwjackiekcoopercom
Name: lower_text, Length: 950, dtype: object

In [35]:
www = df_3["lower_text"].loc[df_3["lower_text"].str.startswith("www") == True]

In [36]:
df_3 = df_3.loc[(df_3["lower_text"].str.startswith("www") != True)]

In [37]:
df_2['target'] = df_2['category'].replace(['the worldpost', 'worldpost'],'world news').replace(['black voices', 'queer voices', 'latino voices', 'women'],'diverse voices')

In [38]:
df_2.drop(["authors", "link", "category", "date", "full_text"], axis = 1, inplace = True)

In [39]:
df_2

Unnamed: 0,headline,short_description,lower_text,target
0,There Were 2 Mass Shootings In Texas Last Week...,She left her husband. He killed their children...,meichun jau huffpost amanda painter sole survi...,crime
1,Will Smith Joins Diplo And Nicky Jam For The 2...,Of course it has a song.,2018 fifa world cup starts june 14 russia offi...,entertainment
2,Hugh Grant Marries For The First Time At Age 57,The actor and his longtime girlfriend Anna Ebe...,love actually turned matrimony hugh grant 57ye...,entertainment
3,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,The actor gives Dems an ass-kicking for not fi...,rep adam schiff dcalif fellow democrats better...,entertainment
4,Julianna Margulies Uses Donald Trump Poop Bags...,"The ""Dietland"" actress said using the bags is ...",dietland star told host jimmy fallon uses trum...,entertainment
...,...,...,...,...
40378,Book Publishers Are Scrambling To Release Trum...,Several forthcoming new books center on coping...,day donald trumps election win many americans ...,arts & culture
40379,Second NFL Player Targeted By Stomach-Turning ...,Broncos linebacker Brandon Marshall said he wa...,justin edmonds via getty images brandon marsha...,diverse voices
40380,Senate Democrats Give Up On Coal Miner Health ...,Sen. Joe Manchin said he hopes to enlist Presi...,coalstate senate democrats friday backed threa...,politics
40381,Courtesy Over Death Penalty Cases May Be Dead ...,The justices agreed to stay one man's executio...,days presidential election chief justice john ...,politics


Now that I have removed more stopwords I built a multinomial Naive Baise model and a random forest model which were scored on accuracy that looked at the leftover text after the stopwords were removed. These models are relatively basic as they only take into account the number of times a word is used in each category of news, wherease a neural network will also look at the specific word location in a text as well.

In [None]:
y = df_2["target"]
X = df_2["lower_text"]

In [41]:
stopwords_list = stopwords.words('english') + list(pictwitter) + list(http) + list(www)
idf_train, idf_test, y_tr, y_t, vectorizer = tfidf(X, y, stopwords_list)

In [42]:
def classify_text(classifier, tf_idf_train, tf_idf_test, y_train):
    classifier.fit(tf_idf_train, y_train)
    train_preds = classifier.predict(tf_idf_train)
    test_preds = classifier.predict(tf_idf_test)
    return train_preds, test_preds

def score_preds(y_test,y_train,test_preds, train_preds):
    print("Train: ", accuracy_score(y_train, train_preds))
    print("Test: ", accuracy_score(y_test, test_preds))

In [64]:
rfc = RandomForestClassifier()
nb_classifier = MultinomialNB()

In [44]:
nb_train_preds, nb_test_preds = classify_text(nb_classifier, idf_train, idf_test, y_tr)

In [45]:
score_preds(y_t, y_tr, nb_test_preds, nb_train_preds)

Train:  0.40972057578323456
Test:  0.3966673440357651


In [46]:
rf_train_preds, rf_test_preds = classify_text(rfc, idf_train, idf_test, y_tr)

score_preds(y_t, y_tr, rf_test_preds, rf_train_preds)

Train:  0.9969517358171042
Test:  0.6134931924405609


In [47]:
from sklearn.model_selection import GridSearchCV

In [65]:
param_grid = {
    "n_estimators": [100, 150],
    "criterion": ["gini", "entropy"],
    "min_samples_split": [2,4],
    "min_samples_leaf": [1,3]
}

In [69]:
grid = GridSearchCV(rfc, param_grid, scoring = "accuracy", cv = 2, verbose = True)

In [70]:
grid.fit(idf_train, y_tr)

Fitting 2 folds for each of 16 candidates, totalling 32 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed: 60.1min finished


GridSearchCV(cv=2, estimator=RandomForestClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'min_samples_leaf': [1, 3],
                         'min_samples_split': [2, 4],
                         'n_estimators': [100, 150]},
             scoring='accuracy', verbose=True)

In [71]:
grid.best_params_

{'criterion': 'gini',
 'min_samples_leaf': 1,
 'min_samples_split': 4,
 'n_estimators': 100}

In [72]:
rfc_better = RandomForestClassifier(n_estimators = 100, min_samples_leaf = 1, min_samples_split = 4, criterion = "gini")

In [73]:
rf_train_preds_better, rf_test_preds_better = classify_text(rfc_better, idf_train, idf_test, y_tr)

score_preds(y_t, y_tr, rf_test_preds_better, rf_train_preds_better)

Train:  0.996917866215072
Test:  0.6127819548872181


As we can see, with the data we had managed to collect at that time the best training accuracy these models could provide was about 61 percent accurate. Which given this was sorting for about 36 categories at the time, is significantly more useful than a random guess which should have an accuracy score of about 2.7 percent.