# Questionable Tweet Classification


### Binary Classification Using NLP

Wesley Bosse, Oct. 30th

---

### Problem:

The number of tweets sent every second is estimated to be between 6-10k. When it comes to the moderation of these tweets, it would take extraordinary man power to judge by hand. A binary classification model that is tuned for correctly identifying questionable content could cut this work down tremendously. 

I am approaching this problem as someone building a proof of concept model for moderating tweets posted from a network of high-profile accounts. The goal will be to identify as many of the questionable tweets as possible, without engaging in a scorched earth policy. I will move end to end but not spend too much time in any one area. This way, I can hopefully come up with a decent working model with plenty of ideas as to how to iterate.

### Data:

I was provided with 22,000 labeled tweets, nearly 3000 of which were marked as questionable. 

DATA DICTIONARY:
 - `post_id` : an alphanumeric string that pertains to some post id, although it is not the original twitter id.
 - `text`: a string of the tweet content
 - `date_posted`: the timestamp on the tweet, as a string
 - `questionable_content`: a binary label for each tweet

---

### Data Cleaning

There were 70 tweets missing their text, which were dropped. The Snowball and Porter Stemmers were tested throughout the process, as well as the Word Net Lemmatizer.

The cleaning that needs to be done:
    - Cleaning up retweets, mentions
    - Removing numbers
    - Removing non-english text
    
There were several performance bottle necks in this notebook, one of which I believe could be solved by using hashing vectorizer, especially if this process was applied to a larger data set.

### Exploratory Data Analysis

The code file [here]()

During my EDA, I completed the following steps:
    - Check target class balance/data set size
    - Clean up missing data
    - Re-format and Investigate `date_posted` column
    - CountVectorize and Model as EDA: look and unigrams to see what words are important to off the shelf classification
    - dig for words by eliminating features via Stemming/Lemmatizing
    - Plot ROC curve to investigate how probability thresholds could be leveraged

The EDA models had good accuracies, but recall scores were hovering around 57%. Keeping in mind the context under which we are operating, we are more concerned with optimizing recall than accuracy. Because of this, and for the sake of performance, I will under-sample the data before feature engineering.

---

### Feature Engineering

The code file  [here]()

As far as the important features, visual inspection revealed that common curse words, racial slurs, and sexually explicit phrases all were heavy indicators of questionable content. I would like to build separate lists of words to create features based on the detection of each kind of questionable content, but for the sake of time I used a word list found online/aggregated from words in the EDA results. The list was used to create one single `contains_nsfw` column. To reduce collinearity/allow the model to dig deeper, I also removed the flagged columns from the data set once rows have been labeled with the `contains_nsfw` label.


Some low hanging fruit to include were the VADER polarity scores, which include positive, neutral, negative, and compound polarity scores. Sentiment could be used for interaction terms with the presence of certain vocab, but further work is required to come up with a solid implementation of that. 


### Modeling

The code file [here]()

Based on the EDA and initial performance tests, I stuck with the AdaBoostClassifier. I gridsearched over a few base estimators, n_estimators, etc. 


### Model Performance 
Better logging of metrics is definitely on the to-do list. Base recall started around .54 - .57, undersampling brought us above .6, and our final grid search gave us a recall of ~.72.

With a test set of 1225 tweets from the undersampled set, I am still predicting over 200 false negatives and 100 false positives. Clearly there is room to do better. 

### Pipeline (Attempt)

I attempted to create a pipeline but wrecked it when I was refactoring. I will fix it soon, but wanted to submit this report before. 

---

### Goals

- More cleaning! numbers, non-english, etc. 
- Examples of word lists to incorporate:
    - https://en.wikipedia.org/wiki/Category:Ethnic_and_religious_slurs
    - https://en.wikipedia.org/wiki/Category:Sexuality_and_gender-related_slurs
- Fix Pipeline
- Interaction terms with sentiment.
- Dimensionality reduction
- higher ngram ranges with hashing vectorizer
