<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Reddit - Evaluation and Conclusion

--- 
# Notebook 5

The fifth notebook will include evaluation of the production model, conclusion and the recommendations.

---

# 0. Import Package

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from nltk.tokenize import word_tokenize, WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords

# Import model
from sklearn.naive_bayes import MultinomialNB

# Import Evaluations
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

# 1.1 Evaluation - Baseline Score


The baseline score we are using to evaluate this model is accuracy. The solution to our problem statement is to pull out the top 20 features of each reddit to identify interest of cat and dog owners in order to for our stakeholder to push out new business ventures for this group of audience. By having a high accuracy, the features identified will be relevant to the audience and insightful solutions can be provided.

# 1.2 Evaulation - Relevant Metrics


Four possible relevant metrics for our production model:
* Accuracy
* Sensitivity
* Specificity
* Precision

Accuracy measures how high the true positive and true negatives is. When the accuracy is high, the top features for each subreddit will be from the correct subreddit group and we can then correctly draw our insights from thus this metrics will be our baseline score.

Sensitivity measures how much of the positives are true positives. Since we are not looking at only one class but both classes of both cats and dogs, this metrics will not be able to give us a true success of our production model.

Specificity like sensitivity measures only one class. Specificity measures the amount of true negatives among all the negatives. The reason to reject this metrics as a baseline score is the same of rejecting sensitivity.

Precision measures the ability of a classification model to identify only the relevant data points. It is calculated how much of all the classified positives are true positives. This is a better evaluation metrics as compare to sensitivity and specificity but it still only measure one class thus is not the best evaluation metric for us.

# 1.3 Evaulation - Interpretation of Model Results

In [9]:
# Original list of english stop words in SkLearn Library
text.ENGLISH_STOP_WORDS

# Create list for new stop words
add_stop_words = ['im', 'does']

# Joining new list of stop words to list in SKLearn Library
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Print to check
#stop_words

In [10]:
# Create a function to find the train and test accuracy scores of the models with 5 fold cross validation
def report_error(model, X1, y1, X2, y2):
    model.fit(X1, y1)
    print('The accuracy train score for', model, 'is', cross_val_score(model, X1, y1, cv=5).mean(),'.')
    print('The accuracy test score for',model, 'is', cross_val_score(model, X2, y2, cv=5).mean(),'.')
    print()

In [12]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 

# Instantiate PorterStemmer
porter_stemmer = PorterStemmer()

# Create function to stem
def stem_sentences(sentence):
    tokens = sentence.split()
    stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Count Vectoriser with the additional stop words
X = df_model['title']

# Instantiate a CountVectorizer with the new added stop words and min df = 3 and ngram range from 1 to 3.
cvec = CountVectorizer(stop_words = stop_words, min_df = 1, max_df = 0.35,  ngram_range=(1,1), max_features = 3500)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_model = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

# Remove the subreddit columns and add back the original columns
df_vec_model.drop(columns =['subreddit'], inplace = True)
df_vec_model = pd.concat([df_vec_model, df_model[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

# Define target variable
y = df_vec_model.pop('subreddit')
X = df_vec_model

# Instantiate Multinomial Naive Bayes
nb = MultinomialNB()

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

report_error(nb, X_train, y_train, X_test, y_test)

The accuracy train score for MultinomialNB() is 0.9228375873885936 .
The accuracy test score for MultinomialNB() is 0.9112072605889174 .



In [13]:
def get_salient_words(nb_clf, vect, class_ind):
    words = cvec.get_feature_names()
    zipped = list(zip(words, nb_clf.feature_log_prob_[class_ind]))
    sorted_zip = sorted(zipped, key=lambda t: t[1], reverse=True)
    return sorted_zip

neg_salient_top_20 = get_salient_words(nb, cvec, 0)[:20]
pos_salient_top_20 = get_salient_words(nb, cvec, 1)[:20]

pos_salient_top_20

[('dog', -4.544803418066204),
 ('help', -6.589812763324556),
 ('puppi', -6.793908119668071),
 ('breed', -7.214582646439764),
 ('need', -7.267692471753713),
 ('old', -7.279887744847531),
 ('ha', -7.295344003084223),
 ('advic', -7.386612517197171),
 ('theyv', -7.428430456880493),
 ('doe', -7.561453590281876),
 ('food', -7.598724985079107),
 ('new', -7.598724985079107),
 ('trail', -7.6731575798618765),
 ('wherev', -7.710198851542226),
 ('anyon', -7.763480218155163),
 ('ye', -7.809289754186457),
 ('ani', -7.814511698167609),
 ('like', -7.814511698167609),
 ('just', -7.846429301135914),
 ('look', -7.913493531716459)]

In [14]:
neg_salient_top_20

[('cat', -5.237643947111158),
 ('theyv', -6.164994532451706),
 ('zuko', -6.484474313936721),
 ('new', -6.983244086010304),
 ('ye', -6.989461722621175),
 ('hi', -7.058306104984681),
 ('love', -7.062770398113367),
 ('like', -7.122696159044555),
 ('just', -7.168876904107915),
 ('vs', -7.276451034394055),
 ('kitten', -7.307395838243477),
 ('littl', -7.4582187279780605),
 ('ha', -7.498902304614504),
 ('look', -7.563210171847235),
 ('kitti', -7.631939503999416),
 ('day', -7.6559406560989585),
 ('boy', -7.710007877369234),
 ('old', -7.718591621060626),
 ('help', -7.72724968380374),
 ('babi', -7.762651610854657)]

Our final model achieves an accuracy of higher than 90% accuracy. This gives us the confidence that the word features extracted are reliable to derive insights from.

The top 20 word features do give us an idea of the interest of cat and dog owners.

# 1.4 Evaulation - Domain Knowledge


Appying domain knowledge for each subreddit by gaining insights from the word features.

**Dog Features**
* Puppi - Puppies are ranked 3rd in importance which shows that most dog owners own new born dogs. At this stage, much care is required. A care service can be provided to cater to dogs of a certain young age or training for dog owners to raise their puppies.
* Breed - There are many breeds in dogs. A premium service can be formed to search for exquisite dog breeds to sell to owners who are interested to pay a high price.
* help/need/advice - This 3 keywords implies that dog owners meet a lot of problems with their dogs. A paid hotline can be setup to answer and address these issues.
* trail - This interesting word tells us that dog owners like to walk their dogs and are looking for trails to do that. An interest group can be created where trails with beautiful scenaries can be recommended.
* food - This word features suggest that dog owners are concern about the overall health of dogs and want to feed them properly.

**Cat Features**
* love/like - This 2 features can be interpreted 2 ways. First it can be the expression of cat owners to their cats. As seen during EDA, cat owners tend to want to dress up their cats more. They are more expressive and post a lot of pictures about their cats. Secondly, it can be intrepreted as what their cats love. A grooming service can be setup for cat owners to better express their love for their cats by dressing and grooming them. It is also possible to open a toy and food shop to sell things that cats love.
* kitten - Similar to the above, the word kitten means that a lot of cat owners own their cat when they are young and similar services can be created to baby cats.
* vs - This word feature tells us that cat owners like to compare. When setting up e-commerce shop for cats, it is good to include a comparision feature for the stuff they sell.
* new - This word features suggest that cat owners like to buy new stuff for their cats. While dog owners are more concern about the needs of their dogs, cat owners like to spend on their cats by buying new stuff for them.

# 1.5 Evaulation - Descriptive and Inferential Statistics

**Descriptive**
The following were derived during cleaning and EDA:
* Cat subreddits post much higher emoji and pictures than dog owners.
* Dog subreddits have higher word_count as compare to cat owners.

**Inferential**
* The features drawn all have negative coefficients as Naive Bayers is one directional. This however does not prevent us from understanding the features.

# 2.1 Conclusion - Overall Project

Each of the following steps has help us reach the solution to our problem statement:
* Data collection help us understand the main difference between cat and dog owners in that cat owners like to post pictures. this helps gives us confidence when deriving the recomeendations at the end.
* Date Cleaning help us organise the data better and to create new features for the model such as emoji and word count. This helps improve the overall accuracy scores of our production model after optimisation.
* Preprocessing helps us reduce the features through stemming and addition of stopwords. This helps improve the model in reaching higher accuracy.
* Model Selection helps us find the best model after fitting 7 classification models and optimising the top 2. This then led us to the top features to give recommendations to our stakeholders.

# 2.2 Conclusion - How to obtain Recommendations?

* Refer to 1.4 above

# 2.3 Conclusion - Final Recommendation

The following improvements can be made for our model:
* For modeling rigor, literal subreddit name references should have been removed as individual words (e.g. "dog", "cat") were left in earlier on.
* Having posts with more unique words helps to distinguish between spam and ham but;
* Not all posts with unique words are relevant, so we have to focus more on the coefficient of frequently appearing words
* Perhaps this project could be improved by performing a form of sentiment analysis on the data in future studies. Lastly, though this model scored only around ~ 90% accuracy on testing data, this model still predicts much better than the baseline (~ 50.6%) and might be helpful in differentiating between dogs and cats.
* Further analysis can be done on the emojis within the titles.

# 2.4 Conclusion -  Does Conclusion Answers Problem Statement?

The production model hits the accuracy of 90% and the features gave us business recommendations. This is the 2 targets we set out to achieve at the start of the project thus our problem statement is address.

# 2.5 Conclusion - Benefits for Stakeholders

The benefits to stakeholders are 2 fold:
* 1. The production model can be use constantly to understand the trends of dogs and cat owners.
* 2. A variety of business recommendations are provided in 1.4 above for possible business ventures.

# 2.6 Conclusion - Future Steps

Moving forward:
* The same production model can be applied to other subreddit to obtain the interest of other types of pet owners.
* Explore other forums that are more local e.g. Hardwarezone because Reddit posts are more global and may not attune well to local context even though CatDog wants to reach out to an international clientele. Starting small within the local context would be a better way to 'taste' the market first.