## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [22]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
import requests



In [2]:
aml_df = pd.read_csv('./ml')
finance_df = pd.read_csv('./personalfinance')

In [3]:
aml_df.isnull().sum()

Unnamed: 0      0
post          838
title           0
true_y          0
dtype: int64

In [4]:
aml_df = aml_df.replace(np.nan, '')

In [5]:
aml_df.isnull().sum()

Unnamed: 0    0
post          0
title         0
true_y        0
dtype: int64

In [6]:
del aml_df['Unnamed: 0']

In [7]:
aml_df.head()

Unnamed: 0,post,title,true_y
0,"Hey everyone, I just took the Cams test earlie...",Passed my CAMS!,r/moneylaundering
1,Looking forward to going into the AML route. P...,ACAMS 2nd edition study guide,r/moneylaundering
2,Can anyone pls share with me good websites for...,Sample questions bank for cams,r/moneylaundering
3,Whether there is any negative marking for cams...,Cams certification,r/moneylaundering
4,,BSA Data and SARS Value Emphasized by FinCEN D...,r/moneylaundering


In [8]:
finance_df.isnull().sum()

Unnamed: 0    0
post          8
title         0
true_y        0
dtype: int64

In [9]:
del finance_df['Unnamed: 0']

In [10]:
finance_df = finance_df.dropna()

In [11]:
concat_df = pd.concat([finance_df, aml_df], axis = 0, ignore_index= True)

In [12]:
concat_df.head()

Unnamed: 0,post,title,true_y
0,# 30-day challenges\n\nWe are pleased to conti...,"30-Day Challenge #8: Cook more often! (August,...",r/personalfinance
1,"\n### If you need help, please check the [PF W...","Weekday Help Thread for the week of August 27,...",r/personalfinance
2,I sold a gaming computer on ebay and the buyer...,[US] Paypal account balance is -$2000. What ca...,r/personalfinance
3,Hi all. 5 year lurker on reddit here. Like the...,27 YO in £7000 debt and can’t seem to make any...,r/personalfinance
4,"My fiancé borrowed $10,000 from One Main and u...",Fiancé used his vehicle for a loan. He has bad...,r/personalfinance


In [13]:
# clean post, title, and true_y (HTML ARTIFACTS)

artifact = ['\n\n', '\n', '\t', '#', r'r/', 'https://', 'www.', 'http://', 'reddit.com', 'wiki',
            '.', ',', '-',';','"', ':', "[", "[", "(", ")", '[/', '/', '?', "'", '*', '$', '&']

for i in artifact:
    concat_df['post'] = concat_df['post'].map(lambda x: x.replace(i, ''))
    concat_df['title'] = concat_df['title'].map(lambda x: x.replace(i, ''))
    concat_df['true_y'] = concat_df['true_y'].map(lambda x: x.replace(i, ''))

In [14]:
concat_df['interact'] = concat_df['post'] + concat_df['title']

# Since moneylaundering has 800+ missing threads,
# I decided to create an interact feature that combines threads with titles.

In [15]:
# clean stop words:

from nltk.corpus import stopwords
stop = stopwords.words('english')

concat_df['interact'] = concat_df['interact'].apply(lambda x: ' '.join([word for word in x.split(' ') if word not in (stop)]))

In [16]:
concat_df.head()

Unnamed: 0,post,title,true_y,interact
0,30day challengesWe are pleased to continue ou...,30Day Challenge 8 Cook more often! August 2018,personalfinance,30day challengesWe pleased continue 30day cha...
1,If you need help please check the PF Wiki]per...,Weekday Help Thread for the week of August 27 ...,personalfinance,If need help please check PF Wiki]personalfin...
2,I sold a gaming computer on ebay and the buyer...,US] Paypal account balance is 2000 What can th...,personalfinance,I sold gaming computer ebay buyer paid Paypal ...
3,Hi all 5 year lurker on reddit here Like the t...,27 YO in £7000 debt and can’t seem to make any...,personalfinance,Hi 5 year lurker reddit Like titles suggests c...
4,My fiancé borrowed 10000 from One Main and use...,Fiancé used his vehicle for a loan He has bad ...,personalfinance,My fiancé borrowed 10000 One Main used truck c...


## Predicting subreddit using Random Forests + Another Classifier

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [17]:
one_hot = pd.get_dummies(concat_df['true_y'])
del one_hot['personalfinance']

In [18]:
concat_df = concat_df.join(one_hot)
# moneylaundering feature added. (1 = ML) 
# I will drop true_y now to clean up my df.
del concat_df['true_y']

In [19]:
concat_df.tail()

Unnamed: 0,post,title,interact,moneylaundering
1913,,HSBC Hires Lloyds Compliance Executive for Eur...,HSBC Hires Lloyds Compliance Executive Europea...,1
1914,,Russian diplomat loses defamation fight with i...,Russian diplomat loses defamation fight invest...,1
1915,,Italian journalist threatened after helping to...,Italian journalist threatened helping dismantl...,1
1916,,Mahtanis Mystery Shareholder in Michigan,Mahtanis Mystery Shareholder Michigan,1
1917,,MENAFATF recognises Qatars efforts to combat m...,MENAFATF recognises Qatars efforts combat mone...,1


In [92]:
concat_df['string_features'] = concat_df['interact'].apply(lambda x: x[0])

In [94]:
concat_df['interact'].values

array([' 30day challengesWe pleased continue 30day challenge series Past challenges found here]personalfinance30daychallengesThis months 30day challenge Cook often! Two biggest budgetkillers see subreddit lots wasted money eating spending much groceries While everyones situation different want highlight steps help get started Planning half battle It easier cook home make plan week Just getting takeout becomes much tempting figure everything long day Things efficient done bulk Consider making enough leftovers Cooking several meals day also great technique Make use freezer ensure food doesnt go waste Try shop sales If watch ads learn often grocery stores cycle sale It might meat one week cheese next etc So figure cycle area stock up! Walmart offbrand curse words This one way stretch meal planning budget Walmarts price matching policy make buying ingredients one place easier If youre getting started cooking tend eat lot dont feel need jump straight planning entire week meals Leave days un

In [100]:
features = 'interact'
X = concat_df[features].values
y = concat_df['moneylaundering']

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [103]:
X_test

array(['North Korean leaders used Brazilian passports apply Western visas',
       '\x18White Label Money Laundering Services',
       'Is This American Spy Dead Or Was He Ever Real',
       'Corruption Kenya worse ever says veteran campaigner',
       'DBS Hires Standard Chartered’s Lam Chee Kin Head Compliance',
       'Im looking ways promote discussion make subreddit helpfulSo resources subreddit provide Id love hear ideas Openended weekly discussion threads Posts topics Posts fewer topics More posts Career resources FAQ Wiki pages Compliance calendadatebook Social media resources CAMS exam prepHow subreddit useful',
       'Senior Moldovan Judge’s Daughter Lived In Posh London Flat  With Tainted Money  OCCRP',
       'Venezuela vice president squeezes media bosses drugs story',
       'RollsRoyce Vows Compliance After Petrobras Bribe Accusation',
       'My credit good sure I qualify I already submitted financial application honda finance plan applying bank backupI incoming check 

In [85]:
# new_x_train_marks = []
# for entry in X_train:
#     new_x_train_marks.append(entry[0])

In [109]:
# vectorize interact feature:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             lowercase = False,
                             max_features = 5000) 


vectorizer.fit(X_train)
X_train_vec = vectorizer.transform(X_train)

X_test_vec = vectorizer.transform(X_test)
#test_data_features = vectorizer.transform(X_test)

In [87]:
clf = RandomForestClassifier(n_jobs=2, random_state=0)

In [88]:
X_train_vec.shape

(1438, 5000)

In [143]:
feature_names = pd.DataFrame(X_train_vec.todense(), columns=vectorizer.get_feature_names()).head()

In [159]:
zip(feature_names.columns[2399])

'dont'

In [116]:
clf.fit(X_train_vec.todense(), y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [128]:
# Apply Classifier To Test Data

pred = clf.predict(X_test_vec)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,

In [133]:
# View the predicted probabilities of the first 10 observations
clf.predict_proba(X_test_vec)[:10]

array([[0. , 1. ],
       [0. , 1. ],
       [0. , 1. ],
       [0. , 1. ],
       [0. , 1. ],
       [0.2, 0.8],
       [0. , 1. ],
       [0. , 1. ],
       [0. , 1. ],
       [0.8, 0.2]])

In [142]:
np.argsort(clf.feature_importances_)[::-1]

array([ 729, 4549, 2399, ..., 3113, 3112,    0])

In [137]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred)

array([[236,   3],
       [ 19, 222]])

#### Thought experiment: What is the baseline accuracy for this model?

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [None]:
# created above.


#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

# Executive Summary
---
Put your executive summary in a Markdown cell below.