## EDA
This notebook demonstrates the EDA process that I took when first evaluating the Tweets. It was cleaned up at the end, but contains several comments that preserve my approach. A summary of these steps can be found in the technical report notebook

--- 

**Library Imports**

In [1]:
# the classics
import scipy.sparse as sparse
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Patch
%matplotlib inline

# import some pre-written snippets that might be useful.
from wbcustom.roc import plot as roc_plot
#from wbcustom.notebook_analysis import notebook_analysis as notebook

# some specific data manipulation tools from sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# a few models to try at first
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier

# model evaluation tools
from sklearn.metrics import confusion_matrix, roc_auc_score, classification_report

from nltk.stem import WordNetLemmatizer, SnowballStemmer, PorterStemmer

**Reading in the Data**

In [3]:
# read in all of the tweets
tweets = pd.read_csv('./assets/tweets.csv')

FileNotFoundError: File b'./assets/tweets.csv' does not exist

**Initial Inspection**

In [None]:
# quick look at what I'm dealing with
tweets.info()
tweets.head()

22,000 tweets, 4 columns in total:
 - post_id: a unique post id (not the numeric status id from twitter)
 - text: the content of the tweet
 - date_posted: tweet's time-stamp (not in datetime format)
 - questionable_content: target column
 
We see that a few of our data points are missing their original content. Upon first inspection, I was hoping to retrieve the missing data using `python-twitter` and the post_id column, but the values are string identifiers, and not the numeric tweet ids. Next time...  

Let's check the balance of our target class.

In [None]:
cb = tweets['questionable_content'].value_counts()

print('Counts', cb.values)
print('Composition', cb.values/len(tweets))

2902 tweets are marked as questionable, meaning 13%. We will need to keep this slight imbalance in mind during modeling, but for now let's take a look at the 70 tweets that are missing text.

**Addressing the Missing Data**

In [None]:
# gather tweets with null values in the text column
null = tweets[tweets['text'].isnull()]

null.head()

In [None]:
# what are the labels of the missing tweets?
null['questionable_content'].value_counts()

Some of them are labeled as questionable, but we can't hope to figure out why without the tweet's text or the ability to look up the original tweet itself. Dropping these tweets only removes .3% of our data, so I will go ahead and drop them.

In [None]:
# drop tweets with any null values
tweets = tweets.dropna().reset_index(drop=True)

**Quick Look at `date_posted`**

In [None]:
# convert the column to a datetime object, rename it to timestamp
try:
    tweets['timestamp'] = pd.to_datetime(tweets['date_posted'])    
    tweets = tweets.drop(columns='date_posted')
except KeyError:
    pass

# create columns with each time scale value
timestamp = pd.DataFrame({
    'hour': tweets['timestamp'].dt.hour,
    'day': tweets['timestamp'].dt.day,
    'month': tweets['timestamp'].dt.month,
    'year': tweets['timestamp'].dt.year,
    'day_of_week': tweets['timestamp'].dt.dayofweek,
    'week_of_year': tweets['timestamp'].dt.weekofyear
})

# add this new info to the data
tweets = pd.concat(objs=[tweets, timestamp], axis = 1)

Let's take a glance at the new columns

In [None]:
tweets.describe()

In [None]:
set(tweets['year'])

Some tweets are stamped with a 1970 post date. This is obviously an error as twitter wasn't created until 2006. Using the year 1970 as a mask, we can dig in to these erroneous timestamps.

In [None]:
tweets[tweets['year'] == 1970].describe()

It looks like these years are mislabled, but the rest of the timestamp is ok. We will avoid using the year in the model, knowing there are errors. Now lets plot out the count and questionable percentage of tweets on each time scale to look for any trends that exist. 

In [None]:
for column in ['hour', 'day', 'month', 'year', 'day_of_week', 'week_of_year']:
    
    # group tweets by different specified time scale
    groupings = tweets.groupby(by=[tweets[column]])

    # create the variables to plot: tweet counts, questionable content percentages, and time intervals
    y1 = list(groupings['questionable_content'].count())
    y2 = list(groupings['questionable_content'].sum()*100/groupings['questionable_content'].count())
    x = list(set(groupings.keys[0]))
    
    df = pd.DataFrame({
        'index': x,
        'count': y1,
        'questionable percentage': y2
    })
    
    # instantiate plot
    width= .4
    fig = plt.figure(figsize=(12,4)) # create figure
    
    # create the stacked subplots
    ax = fig.add_subplot(111) 
    ax2 = ax.twinx() # copy for side by side bars

    # use pandas to populate subplots
    df['count'].plot(kind='bar', color='red', ax=ax, width=width, position=1)
    df['questionable percentage'].plot(kind='bar', color='blue', ax=ax2, width=width, position=0)
    
    # set label and color for yticks for first plot
    ax.set_ylabel('Tweet Count', )
    for tl in ax.get_yticklabels():
        tl.set_color('b')
    
    # again for second plot
    ax2.set_ylabel('Percent Questionable (%)',)
    for tl in ax2.get_yticklabels():
        tl.set_color('r')
        
    # display final viz
    plt.title(f'Tweet breakdown by {column.capitalize()}')

These charts need to be cleaned up (specifically left spacing and xticks for by-year), but it gives us a quick look at the different trends. If these charts were required in production, I would likely use tableau to add tooltips. 

Some thoughts from looking at these plots:
- There appears to be a visual pattern in the hourly and day_of_week breakdowns. 
- There is an influx of tweets earlier in the day, with a lower proportion of questionable content being posted during that time. 
- Additionally, the percentage of questionable content has dropped off in the past two years despite a higher tweet counts.

Ultimately, we would need to dive in to the data collection methods, timestamps relative to user's timezone, etc., before this information would be more useful to us.

In [None]:
sns.heatmap(tweets.corr())

No real correlations with different time scales and label.

**Baseline Model**

Define the X and y variables, split them up into training and testing sets. Split maintains the class balance in the y variable.

In [None]:
X = tweets['text']
y = tweets['questionable_content']

# stratify on y variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, stratify = y, random_state = 23)

Instantiate a stock count vectorizer to build a simple model based on word counts.

In [None]:
cvec = CountVectorizer()

X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)

In [None]:
# create a dataframe out of the vectorized test data
vec_feats = pd.DataFrame(X_test_cvec.todense(), columns = cvec.get_feature_names())

# inspect
vec_feats.head()

We will want to scrub the non-english tweets later on as well.

Lets create a few different models and see how they do. Use the `%%time` magic command to get some efficiency info.

In [None]:
%%time
rfc = RandomForestClassifier(random_state = 23, n_estimators=50)
rfc.fit(X_train_cvec, y_train)

In [None]:
print('Train Score: ', rfc.score(X_train_cvec, y_train))
print('Test Score: ', rfc.score(X_test_cvec, y_test))

After 2 seconds, we get a fairly 'accurate' model, but we aren't concerned with accuracy. Since we are trying to limit false negatives, we want to maximize recall. Let's dig deeper.  

In [None]:
# use the classification report function for more classification metrics
print(classification_report(y_test, rfc.predict(X_test_cvec)))
print('---')
confusion_matrix(y_test, rfc.predict(X_test_cvec))

Recall of .57 means we are only correctly questionable tweets 57% of the time. We can do better. As we see in the confusion matrix, 315 of the 725 questionable tweets in the test set were incorrectly labled as not questionable.

Let's see what our model is using to predict.

In [None]:
# get feature importances from the model, plot
pd.DataFrame(
    rfc.feature_importances_, 
    index=cvec.get_feature_names()
).sort_values(by=0, ascending=False).head(20).plot(kind='barh', figsize=(12,7))

The presence of common curse words and racial slurs are good indicators of a questionable tweet. Now a boosted model, to see if we can avoid the RFC's overfitting and improve sensitivity. 

In [None]:
%%time
abc = AdaBoostClassifier(n_estimators=50)
abc.fit(X_train_cvec, y_train)

In [None]:
print(classification_report(y_test, abc.predict(X_test_cvec)))
print('---')
confusion_matrix(y_test, abc.predict(X_test_cvec))

Better accuracy, same recall, fit time was 1/10th the RandomForest's. 

In [None]:
pd.DataFrame(abc.feature_importances_, index=cvec.get_feature_names()).sort_values(by=0, ascending=False).head(20).plot(kind='barh', figsize=(12,7))

Similar features, we will use this info for our feature engineering. We will now plot a roc curve to check if playing with our probability thresholds can help us tune.

In [None]:
# custom roc curve function (need to add auc score)
roc_plot(abc.predict_proba(X_test_cvec), y_test)
roc_auc_score(abc.predict(X_test_cvec), y_test)

This curve shows that we have a steep false positive slope, meaning we will get dramatically more false positives as we tune the probability threshold for recall (sensitivity). In our context, we don't mind more tweets to review if it means less get through the filter. 

Now that we have a grip on what is going on, let's dive in to some [feature engineering and modeling](./Data-Prep-and-Feature-Extraction.ipynb)