# Predicting Tweet Sentiments

Goal of this competition: Building a model for Text Classification using Natural Language Processing. In other words, predicting which tweets are POSITIVE, NEGATIVE or NEUTRAL in the test set. This is a supervised learning problem, as the labels, "sentiment_class" were provided in the train.csv dataset. The performance of the model was evaluated using the following metric:
100*f1_score(actual_values, predicted_values, average='weighted').

The approach I had adopted was that of an ensemble model. I trained the Random Forests, Decision Trees, Extra Trees and Gradient Boosting Classifiers with the best possible parameters (that were fitted on the train set using GridSearchCV approach). Finally, used Max Voting technique to get the final_prediction that were made on the test set. Moreover, for the majority of the model I had used scikit-learn to develop it.

Let's dive into code. We were provided with these two datasets, namely, train.csv and test.csv. Let's load them and take our first glance at the datasets.

In [None]:
# importing libraries

import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re
from matplotlib import pyplot as plt
import seaborn as seab
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Import the train and test csv files
pd.set_option('display.max_colwidth', 100)   #to display 100 charactors in each column
train_data = pd.read_csv(r'D:\jobs\A! Hackathon\Mothers_Day\dataset\train.csv')
test_data = pd.read_csv(r'D:\jobs\A! Hackathon\Mothers_Day\dataset\test.csv')
train_data.head()

## Exploratory Data Analysis (EDA)

The first step for creating an ML model is exploring the given datasets (train.csv and test.csv). This is a crucial step.

To explore the data I had generated a report with pandas_profiling. It generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

In [None]:
pandas_profiling.ProfileReport(train_data)

In [None]:
# let's generate a profile for test data as well
pandas_profiling.ProfileReport(test_data)

Let's use the above profiles to make some inferences.

The different variables in the dataset are:
1. id: an index provided to every record in the datasets
2. original_text: The actual Morther's Day tweets
3. lang: The language of the tweets. In this dataset most of them are english, i.e., 'en'
4. retweet_count: The number of times the texts were retweeted
5. original_author: The person that posted the tweet (it is their username) 
6. sentiment_class: This is the 'target' variable. All the tweets were labelled as 1 (Positive), 0 (Neutral) or -1 (Negative). We are to create a Machine Learning model that uses Natural Language Processing to predict these labels on the test data.

In the field 'sentiment_class', how many Positive (1), Negative(-1) and Neutral(0) texts are there?

In [None]:
%matplotlib inline
seab.countplot(x='sentiment_class', data=train_data)
print('Out of {} rows, {} are labelled as 1, {} are labelled as 0 and {} are labelled as -1'.format(
                                                                len(train_data), 
                                                                len(train_data[train_data['sentiment_class'] == 1]), 
                                                                len(train_data[train_data['sentiment_class'] == 0]),
                                                                len(train_data[train_data['sentiment_class'] == -1])))

It looks like there is a huge imbalance in the data labels. Shall we balance it using SMOTE? or Import a pretrained model?  But for using a deep learning model we need loads and loads of data. Therefore, to handle the imbalnce in data, I had used the trees classifiers which come with a class_weight variable that helps tackle this issue.

Moreover, there seems to be only few missing values (<0.1%). So, I am going to let them be for now. Moreover, the variables 'lang', 'retweet_count' and 'original_author' have high cardinality. Groping them into similar categories or any other engineering would overfit the model to this dataset. Hence, we are going to drop these three variables.

## Feature Engineering

Feature engineering is the most important part of creating a Machine Learning model. More sensible the features considered are, the more accurate out model would be. In this context, that is, tweet classification, I believe the best feature would be that of a clean text.

"clean_original_text" is the feature I created. Which is, the given tweets in field "original_text" were cleaned. The follwing cleaning was done:

- Various factors such as URLs, hashtags, usernames, punctuations and stopwords were removed
- Duplicated were removed, i.e., hellllooooooo was changed to hello 
- The tokenized words were stemmed. We do this by building a function called "clean_text". This function was called everytime we would want to clean the "original_text"

The engineered feature was further vetorized using TF-IDF vectorizer.

The above procedure was adopted under the assumption that cleansed and vectorized data could facilitate a more accurate model.

In [None]:
# Feature Generation ---- clean_original_text
# Cleaning our data --- remove punctuations,stopwords, #, usernames and URLs also let's stem or lemmatize

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()    # the stemmer used here is PorterStemmer

def clean_text(given_text):
    text_noURL = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', given_text) # remove URLs
    text_nouser = re.sub('@[^\s]+', 'AT_USER', text_noURL)                       # remove usernames
    text_nohash = re.sub(r'#([^\s]+)', r'\1', text_nouser)                        # remove the # in #hashtag
    text_token = word_tokenize(text_nohash)           # remove repeated characters (helloooooooo into hello)
    text_nopunct = " ".join([word.lower() for word in text_token if word not in string.punctuation])    #lowercase and remove punctuations
    text_tokenized = re.split('\W+', text_nopunct)                                       # tokinize the text
    text = [ps.stem(word) for word in text_tokenized if word not in stopwords]     # remove stop words and stem the non-stopwords
    return text

# Now let's vectorize using tfidf
tfidf_vect = TfidfVectorizer(analyzer = clean_text)
X_features = tfidf_vect.fit_transform(train_data['original_text']).toarray()
X_test = tfidf_vect.transform(test_data['original_text']).toarray()
Y = train_data['sentiment_class']

## Model Selection, Comparision and Evaluation

The various classifiers trained were:
1. Random Forest
2. Decision Trees
3. ExtraTrees Classifier
4. Gradient Boosting Trees

GridSearchCV was used to fit these models on best possible parameters. Further, these models were compared and finally an ensemble model was created.

In [None]:
# Random Forest Classifier
rf = RandomForestClassifier()
params = {'n_estimators' : [10, 15, 20], 
          'max_depth' : [205, 300, 305, 400]
         }
gs = GridSearchCV(rf, params, cv=5, scoring='f1_weighted', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

In [None]:
# Decision Tree
dt = DecisionTreeClassifier()
params = {'max_depth' : [30, 50, 99], 
          'max_features' : [1000, 2000, 5000, 10000]
         }
gs = GridSearchCV(dt, params, cv=5, scoring='f1_weighted', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_) 

In [None]:
# Extra trees
etc = ExtraTreesClassifier(random_state=10)
params = {'n_estimators' : [10, 50, 100, 500, 1000], 
          'max_features' : [5, 10, 50, 100]
         }
gs = GridSearchCV(etc, params, cv=5, scoring='f1_weighted', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

In [None]:
# Gradient Boosted Trees
gb = GradientBoostingClassifier(random_state=0)
params = {'n_estimators' : [10, 50, 100], 
          'max_depth' : [5, 10, 30, 60]
         }
gs = GridSearchCV(gb, params, cv=5, scoring='f1_weighted', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

## Ensemble Model

Further an ensemble model was created using Max Voting technique which was implemented using the VotingClassifier.

In [None]:
rf = RandomForestClassifier(n_estimators=10, max_depth=300, class_weight='balanced', n_jobs=-1)
dt = DecisionTreeClassifier(max_depth=99, max_features=400)
etc = ExtraTreesClassifier(n_estimators=100, max_features=2000, n_jobs=-1) 
gb = GradientBoostingClassifier(n_estimators=10, max_depth=17, learning_rate=0.1)

# Max Voting method
ensemble = VotingClassifier(estimators=[('rf', rf), ('dt', dt), ('etc', etc), ('gb', gb)], voting='soft', n_jobs=-1, 
                            weights=[2,2,1,1], flatten_transform=True)
ensemble.fit(X_features, Y)
final_predictions = ensemble.predict(X_test)

### Saving the predictions to a csv file

In [None]:
output = pd.DataFrame({'id': test_data.id, 'sentiment_class': final_predictions})
output.to_csv(r'D:\jobs\A! Hackathon\Mothers_Day\dataset\submission_rf5.csv', index=False)
print("Your submission was successfully saved!")