# Climate Change Belief Analysis
**Team 2 JHB July 2020**



# Introduction

### Background  

In a [research article](https://www.barrons.com/articles/two-thirds-of-north-americans-prefer-eco-friendly-brands-study-finds-51578661728) conducted, 19,000 customers from 28 countries where given a poll to find out how individual shopping decisions are changing. Nearly 70% of consumers in the U.S. and Canada find that it is important for a company or brand to be sustainable or eco-friendly. More than a third (40%) of the respondents globally said that they are purpose-driven consumers, who select brands based on how well they align with their personal beliefs.

Many companies are built around lessening their environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.  

The goal of this challenge is to build a Classification Machine Learning model that will determine whether a person believes in Climate Change using tweet data. This model will provide insights of public opinion of Climate Change & consumer sentiment to companies looking to market their new or improved products or services to consumers, in response to CER.

As the demand for sustainable, eco-friendly products and services by consumers increases, a sentiment classification model that identifies these potential customers is key and could be used any business or organisation committed to carbon neutrality & wanting to inform marketing strategies. This includes, but is not limited to companies in the retail, automotive, government, agriculture & food, pharmaceutical spheres. The model could also be used by sectors in government wanting to identify the various belief sentiments in order to better direct environmental awareness and education campaigns in alignment with their legislative directives and climate change response plans.


### Problem statement  

Build a machine learning model that is able to classify whether or not an individual believes in man-made climate change based on historical tweet data to increase insights about customers and inform future marketing strategies.

You can find the project overview [here](https://www.kaggle.com/c/climate-change-edsa2020-21).

# Notebook outline

1. Installations and Imports
2. Explore Data Analysis

**Base Model**
3. Data Preprocessing
4. Text Feature Extraction
5. Model Building
6. Model Evaluation
7. Model Analysis
8. Submition

**Tuned and Improved Model**
9. Data Preprocessing
10. Text Feature Extraction
11. Modelling
12. Model Performance
13. Hyperparameter Tuning of Best Models
14. Model Analysis
15. ROC Curves and AUC
16. Save Output
17. Conclusion
18. Comet
19. References

# 1. Installations and Imports

### 1.1 Installations

In [None]:
pip install comet_ml



### 1.2 Imports

In [None]:
from comet_ml import Experiment

In [None]:
# Create an experiment with your api key:
experiment = Experiment(
    api_key="06V8ejxSIh2dFMs9ne4vusQXq",
    project_name="climate-change-belief-analysis",
    workspace="bmqhamane",
)

Import python libraries

In [None]:

# Loading Data
import pandas as pd
import numpy as np
import nltk
import string
import re
import time

# Data Preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.utils import resample
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer

# Model Building
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score

# Model Evaluation
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
#from scikitplot.metrics import plot_roc, plot_confusion_matrix

# Explore Data Analysis
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from wordcloud import WordCloud, STOPWORDS
from matplotlib.pyplot import rcParams

from sklearn.feature_extraction.text import CountVectorizer




In [None]:
#download libraries
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
sns.set_style('whitegrid')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 1.3 Import Data

In [None]:
from google.colab import files 
  
  
uploaded = files.upload()

We will load our data as a Pandas DataFrame

In [None]:
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv') 

# 2. Exploratory Data Analysis (EDA)

The section is an exploration of the data through an analysis of the different Climate Change sentiments that people have on Twitter.

**Techniques that we are going to use to analyse our data**

- Understanding the distribution of sentiments
- An analysis of the Tweets statistics
- Understanding the length of our tweets
- The main topics on climate change

In [None]:
#create a copy of the origional data
ftrain = train.copy()
ftest = test.copy()

In [None]:
print('There are', len(ftrain), 'rows and',ftrain.shape[1], 'columns in the train set.')
print('There are', len(ftest), 'rows and',ftest.shape[1], 'columns in the test set.')

Checking for null values in the data

In [None]:
#test data
ftest.isnull().sum()

In [None]:
#train data
ftrain.isnull().sum()

## 2.1 The distribution of climate change sentiments 




Understanding the distribution of sentiments surrounding climate change on Twitter communicates that there are different views on climate change hence the different classes associated with these views/sentiments.

In [None]:
#extract the value counts per sentiment class
a = ftrain.sentiment.value_counts()
#calculate the percentage of each sentiment class
b = 100*ftrain.sentiment.value_counts()/len(ftrain.sentiment)
b = round(b,2)
data = pd.concat([a,b],axis =1,)
data.columns = ['Value Count', 'Percentage']
data

In [None]:
sns.countplot(x='sentiment',data=ftrain,palette='rainbow')

As seen in the bar graph, sentiment class 1 has the highest number of tweets in the train data accounting for 8530 tweets(53.92%).The lowest sentiment class is class -1 which accounts for 1296 tweets (8.19%).The distribution of sentiments classes are imbalanced because the classes do not have the same ammount of tweets in their class as seen in dataframe which compares the value counts and percentage of each sentiment class.

The class imbalance of the training data has an impact on the classification made on the unseen data (testing data) in the modeling phase.A class imbalance could result in the model classifying most of the tweets into sentiment class 1 since the model gets better a classifying class 1 tweets as the model has more evidence  of class 1 tweets.This will be taken into consideration in the preprocessing and modeling section of the notebook.





## 2.2 An overview of tweets statistics

In [None]:
#brief description of the train data
ftrain.message.describe()

In [None]:
#brief description of the test data
ftest.message.describe()

In [None]:
#description of the data per sentiment class
ftrain[['sentiment','message']].groupby('sentiment').describe()

Adding a column of the tweets length/character count to the data

In [None]:
ftrain['length'] = ftrain['message'].apply(len)
ftrain.head()

In [None]:
#creating a lenght column
ftest['length'] = ftest['message'].apply(len)
ftest.head()

## 2.3 The distribution of the tweets length in the data

In [None]:
sns.distplot(ftrain['length'],bins=30,kde=False,color='#440154')

In [None]:
ftrain['length'].describe()

In [None]:
#print the longest tweet in the train data
ftrain[ftrain['length'] == 208]['message'].iloc[0]

The tweets length in the train data lie between 208 characters and 14 characters.The average length of tweets is 123 characters.The longest tweet on climate change in the train data contrains 208 words.The longest tweet stands out from the average length of tweets on climate change which is 123 words.The cell illustrates that the tweet with the most words is simply made up of only a few actual words this will be taken into consideration in the preprocessing section of the notebook to ensure that any noise in the tweets are removed.

In [None]:
sns.distplot(ftest['length'],bins=30,kde=False,color='#20A387')

In [None]:
ftest['length'].describe()

In [None]:
ftest[ftest['length'] == 623]['message'].iloc[0]

The tweet in the test data are betweet 7 characters and 623 characters.On average the tweets in the test data are 123 characters.The longest tweets seem to have soe discrepency because twitter's word limit  on tweets in 280 characters however the longest tweet exceeds this limit.The longest tweet in the data is simply made up of only a few actual words this will be taken into consideration in the preprocessing section of the notebook to ensure that any noise in the tweets are removed 

### The length of tweets per sentiment class

In [None]:
g = sns.FacetGrid(ftrain,col='sentiment')
g.map(plt.hist,'length')

Tweets that are part of sentiment class one have have the highest length frequency as compared to the other classes. 


## 2.4 The main topics surrounding the climate change tweets

An understanding of the main topics dicussed in the climate change discussion on twitter is essential as it illustrates the sentiments attatched to climate change. This is done through extracting the most frequently used words and hashtags.

### 2.4.1 Top 30 used words in the tweets

 Train data

In [None]:
#convert the test to numerical values 
cv = CountVectorizer(stop_words = 'english')
words = cv.fit_transform(ftrain.message)

sum_words = words.sum(axis=0)
#create a frequency of most occuring words
words_freq = [(word, sum_words[0, i]) for word, i in cv.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
#create a dataframe of the words and frequency 
frequency = pd.DataFrame(words_freq, columns=['word', 'freq'])

frequency.head(30).plot(x='word', y='freq', kind='bar', figsize=(15, 7), color = '#440154')
plt.title("Train : Most Frequently Occuring Words - Top 30",size=15)

Test data

In [None]:
cv = CountVectorizer(stop_words = 'english')
words = cv.fit_transform(ftest.message)

sum_words = words.sum(axis=0)

words_freq = [(word, sum_words[0, i]) for word, i in cv.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)

frequency = pd.DataFrame(words_freq, columns=['word', 'freq'])

frequency.head(30).plot(x='word', y='freq', kind='bar', figsize=(15, 7), color = '#20A387')
plt.title("Test : Most Frequently Occuring Words - Top 30", size =15)

In [None]:
#creating a word cloud from the data
wordcloud = WordCloud(background_color = 'white', 
                      width = 1000, height = 1000).generate_from_frequencies(dict(words_freq))

plt.figure(figsize=(8,8))
plt.title("WordCloud - Vocabulary from tweets")
plt.imshow(wordcloud)

### 2.4.2 The top 10 influencial Twitter accounts per Sentiment Class

The accounts that recieved the most mentions are Twitter accounts that have engaged with the climate change topic.Twitter users mention these accounts when reposting(retweeting) the twitter accounts sentiment on climate change or responding to the twitter accounts comment on climate change.Within the data these Twitter accounts have played a vital role in fueling the climate change debate on Twitter.

In [None]:
def mentions(text):
    """
    The function extracts all the 
    mentions from the message columns
    """
    line=re.findall(r'(?<=@)\w+',text)
    return " ".join(line)

In [None]:
#creating a mentions column
ftrain['mentions']=ftrain['message'].apply(lambda x:mentions(x))

train_neg = ftrain.loc[ftrain['sentiment'] == -1]
train0 = ftrain.loc[ftrain['sentiment'] == 0]
train1 = ftrain.loc[ftrain['sentiment'] == 1]
train2 = ftrain.loc[ftrain['sentiment'] == 2]

In [None]:
#counting the mentions in the data
temp_neg= train_neg['mentions'].value_counts()[:][1:11]

temp_neg =temp_neg.to_frame().reset_index().rename(columns={'index':'Mentions','mentions':'count'})

plt.figure(figsize=(16,5))
x= temp_neg['Mentions']
y= temp_neg['count']

plt.title('Sentiment Class -1',size =15)
sns.barplot(x=y,y=x,color='#ff7f00')

In [None]:
#counting the mentions in the data
temp0= train0['mentions'].value_counts()[:][1:11]
temp0 =temp0.to_frame().reset_index().rename(columns={'index':'Mentions','mentions':'count'})
plt.figure(figsize=(16,5))

x= temp0['Mentions']
y= temp0['count']

plt.title('Sentiment Class 0',size =15)
sns.barplot(x=y,y=x,color='#fb9a99')

In [None]:
#counting the mentions in the data
temp1= train1['mentions'].value_counts()[:][1:11]
temp1 =temp1.to_frame().reset_index().rename(columns={'index':'Mentions','mentions':'count'})
plt.figure(figsize=(16,5))

x= temp1['Mentions']
y= temp1['count']
plt.title('Sentiment Class 1',size =15)
sns.barplot(x=y,y=x,color='#33a02c')

In [None]:
#counting the mentions in the data
temp2= train2['mentions'].value_counts()[:][1:11]
temp2 =temp2.to_frame().reset_index().rename(columns={'index':'Mentions','mentions':'count'})
plt.figure(figsize=(16,5))

x= temp2['Mentions']
y= temp2['count']
plt.title('Sentiment Class 2',size =15)
sns.barplot(x=y,y=x,color='#b2df8a')

### 2.4.3 An analysis of the Hashtags used  per sentiment class

A hashtags is written using the '#' symbol.Its main function is to categorize tweets based on a keyword or a topic associated with the hashtag used. According to the 'Twitter Help Center' website people use hashtags before a relevant phrase or keyword. 

The hashtags used in the climate change tweets highlight the people's interest in the climate change topic.The hashtags that were used communicate that people have divided opinions on climate change.This is relfected in the hashtags used within each sentiment class.  

In [None]:
# collecting the hashtags

def hashtag_extract(x):
    """
    The function extract the hashtags
    from the messages column
    """
    hashtags = []    
    for i in x:
        ht = re.findall(r"#(\w+)", i)
        hashtags.append(ht)
    return hashtags

In [None]:
# extracting hashtags from train tweets
HT_train_neg = hashtag_extract(ftrain['message'][ftrain['sentiment'] == -1])
HT_train0 = hashtag_extract(ftrain['message'][ftrain['sentiment'] == 0])
HT_train1 = hashtag_extract(ftrain['message'][ftrain['sentiment'] == 1])
HT_train2 = hashtag_extract(ftrain['message'][ftrain['sentiment'] == 2])


# unnesting list
HT_train_neg = sum(HT_train_neg,[])
HT_train0 = sum(HT_train0,[])
HT_train1 = sum(HT_train1,[])
HT_train2 = sum(HT_train2,[])

#### 2.4.3.1 Top 10 hashtags used in Sentiment class  -1 tweets

In [None]:
#creating a frequency distribution of the hashtags
a = nltk.FreqDist(HT_train_neg)
d = pd.DataFrame({'Hashtag': list(a.keys()),
                  'Count': list(a.values())})

# selecting top 10 most frequent hashtags     
d = d.nlargest(columns="Count", n = 10) 
plt.figure(figsize=(16,5))
ax = sns.barplot(data=d, x= "Hashtag", y = "Count", color ='#ff7f00')
ax.set(ylabel = 'Count')
plt.show()

In [None]:
#An example of a sentiment found within class -1 tweets
ftrain[ftrain['sentiment'] == -1]['message'].iloc[67]

In class -1 the hashtag that was used the most is #MAGA and the second highest being #climate.These keywords were the most used when people were discussing their sentiments concerning climate change.Other interesting hashtags that form part of the top ten hashtags used in class one are #fakenews and #ClimateScam which insinuate that some of the people who were tweeting about climate change believe that is is simply fake news or a scam. The third highest hashtag used is #Trump when discussing climate change. The class focuses more on discussing climate change as being linked to politics hence the hashtag that has been used the most is #MAGA as well as the example of one of the tweets provided in the cell above.

#### 2.4.3.2 Top 10 hashtags used in Sentiment class 0 tweets

In [None]:
#creating a frequency distribution of the hashtags
a = nltk.FreqDist(HT_train0)
c = pd.DataFrame({'Hashtag': list(a.keys()),
                  'Count': list(a.values())})

# selecting top 10 most frequent hashtags 
c = c.nlargest(columns="Count", n = 10) 
plt.figure(figsize=(16,5))
ax = sns.barplot(data=c, x= "Hashtag", y = "Count",color ='#fb9a99')
ax.set(ylabel = 'Count')
plt.show()

In [None]:
#An example of a sentiment found within class 0 tweets
ftrain[ftrain['sentiment'] == 0]['message'].iloc[184]

In [None]:
#An example of a sentiment found within class 0 tweets
ftrain[ftrain['sentiment'] == 0]['message'].iloc[197]

The keyword that is used the most when discussing climate change is #climate followed by #climatechange.#Trump is a prominent hashtag in class 0 as well.Donald Trump's views on climate change is discussed in the class.An interesting hashtag used by people is #BeforeTheFlood which is a movie that depicts the impacts of climate change on the Earth,as well as #amreading people use this hashtage to tell mention what they a book or article they are currently reading. The sentiments within class 0 are open conversations surrounding climate change including people asking questions about climate change as well as sarcasm.

#### 2.4.3.3 Top 10 hashtags used in Sentiment class 1 tweets

In [None]:
#creating a frequency distribution of the hashtags
a = nltk.FreqDist(HT_train1)
d = pd.DataFrame({'Hashtag': list(a.keys()),
                  'Count': list(a.values())})

# selecting top 10 most frequent hashtags     
d = d.nlargest(columns="Count", n = 10) 
plt.figure(figsize=(16,5))
ax = sns.barplot(data=d, x= "Hashtag", y = "Count",color ='#33a02c')
ax.set(ylabel = 'Count')
plt.show()

In [None]:
#An example of a sentiment found within class 1 tweets
ftrain[ftrain['sentiment'] == 1]['message'].iloc[89]

The opinions on climate change in class 1 shift towards climate change does exist as the conversations in this class discuss a movie called Before the flood.The movie highlights the impact of climate change on the Earth.As well as using the hashtag  #ActOnClimate, the tweets associated with the hastag on Twitter mainly discuss ways to combat climate change (http://www.tweepy.net/hashtag/ActOnClimate). 

#### 2.4.3.4 Top 10 hashtags used in Sentiment class 2 tweets

In [None]:
#creating a frequency distribution of the hashtags
a = nltk.FreqDist(HT_train2)
d = pd.DataFrame({'Hashtag': list(a.keys()),
                  'Count': list(a.values())})

# selecting top 10 most frequent hashtags     
d = d.nlargest(columns="Count", n = 10) 
plt.figure(figsize=(16,5))
ax = sns.barplot(data=d, x= "Hashtag", y = "Count",color='#b2df8a')
ax.set(ylabel = 'Count')
plt.show()

In [None]:
#An example of a sentiment found within class 2 tweets
ftrain[ftrain['sentiment'] == 2]['message'].iloc[1]

The opinions in class one mainly focus on the climate this is evident in the high hashtag count of the word #climate, the second highest is #enviroment .The class is mainly focused on informing people about climate change and its effect on the enviroment.

# 2.5 The key findings from the EDA

* There are polarised views on climate change on twitter

* Within the data there exists a class imbalance,this will be considered in the preprocessing and model training section

* An analysis of the hashtags has shown that the tweets in class 1 believe in climate change,class 2 believe and inform people about climate change,class 0 are more neutral and tend to downplay the existence of climate change and class -1 do not believe that climate change exists.



In [None]:
#class_1 the PRO class
#class 2 the NEWS class
#class 0 NUETRAL class
#class -1 the ANTI class

# The Based Model 

In this section, we will be cover the process of building a base model starting from the preprocessing of data up to the model building and evaluation.

# 3. Data Preprocessing

Preprocessing involves the elimination of trivial or less informative data, which does not contribute to the sentiment classification. To understand the process of eliminating less informed data, it is important to understand what matters in sentiment analysis. Words are the most important part, however, when it comes to things like punctuation, you cannot get the sentiment from punctuation. Therefore, punctuation does not matter in sentiment analysis. In addition, tweet elements such as images, videos, URLs, usernames, emojis do not contribute to the polarity of the tweet (whether positive or negative). However, this is only true for machine learning models.

**Techniques that we are going to use to clean our data**

- Removing Noise
- Stop Words
- Tokenisation
- Lemmatisation Normalization


###3.1 Dealing with Class Imbalance - Resampling

The EDA highlighted that there is a class imbalance within the data.In training classification model, it is preferable for all classes to have a relatively even split of observations. However, in the wild, classification datasets often come with unevenly distributed observations with one class or set of classes having way more observations than others.This will negatively affecting the accuracy score of the model. Therefore resampling is necessary before training a model with this data.

Resampling methods aim at modifying the dataset in order to reduce the discrepancy among the sizes of the classes. In this regard, two scenarios are proposed: one that eliminates instances from the majority class - called undersampling, and one that generates instances for the minority class - called over-sampling. They both have there pros and cons.In other words, Both oversampling and undersampling involve introducing a bias to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken. Pykes mentined that "the random oversampling may increase the likelihood of overfitting occurring since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cove one replicated example" and “In random under-sampling (potentially), vast quantities of data are discarded. This can be highly problematic, as the loss of such data can make the decision boundary between the minority and majority instances harder to learn, resulting in a loss in classification performance.”

In [None]:
from IPython.display import Image
Image('resampling.png', width="800" ,height="400")

#### Combining Both Random Sampling Techniques

Combining both random sampling methods can occasionally result in overall improved performance in comparison to the methods being performed in isolation. In this predict we will balance our data by using both methods oversampling and undersampling method. The class size is determined by the average of data points. If a class is less than the average the class will be upsampled and if the class is greater than the average, then the class will be downsampled.

In [None]:
def resambling(df):
    """
        The functions takes in dataframe and resample the classses base on class size.
        The class size is a average of the datasets among the classes.
        This function resamples by downsampling classes with observations greater than the class size and
        upsampling the classes with observations smaller than the class size.
    """
    df = df.copy()
    class_2 = df[df['sentiment'] == 2]  
    class_1 = df[df['sentiment'] == 1]  
    class_0 = df[df['sentiment'] == 0]  
    class_n1 = df[df['sentiment'] == -1] 
    class_size = int((len(class_1)+len(class_2)+len(class_0)+len(class_n1))/4)
    # Downsampling class_1 the PRO class
    rclass_1 = resample(class_1, replace=True, n_samples=class_size, random_state=42)
    #upsampling class 2 the NEWS class
    rclass_2 = resample(class_2, replace=True, n_samples=class_size, random_state=42)
    #upsampling class 0 NUETRAL class
    rclass_0 = resample(class_0, replace=True, n_samples=class_size, random_state=42)
    #upsampling class -1 the ANTI class
    rclass_n1 = resample(class_n1, replace=True, n_samples=class_size, random_state=42)
    sampled_df = pd.concat([rclass_2, rclass_1, rclass_0, rclass_n1])
    
    return sampled_df

In [None]:
Resampled_train_df = resambling(train)

In [None]:
news=Resampled_train_df[Resampled_train_df.sentiment == 2].shape[0]
pro =Resampled_train_df[Resampled_train_df.sentiment == 1].shape[0]
neutral=Resampled_train_df[Resampled_train_df.sentiment == 0].shape[0]
anti =Resampled_train_df[Resampled_train_df.sentiment == -1].shape[0]
#visualising
plt.figure(1,figsize=(14,8))
plt.bar(["News", "Pro", "Neutral" , "Anti"],[news, pro, neutral , anti])
plt.xlabel('Tweet_class')
plt.ylabel('Sentiment counts')
plt.title('Class Distributions')
plt.show()

## 3.2 Text Cleaning

Before we begin with data cleaning we created copies of the dataframe which allows us to make some changes without changing the original dataframe

In [None]:
# Creating copies of dataframes
train_copy = Resampled_train_df.copy()
test_copy = ftest.copy()

### 3.2.1 Removing Noise

In text analysis, eliminating noise  is the most important part of getting the data into usable format. 

We will remove noise with the following steps.
- Convert letters to lowercases
- Remove URL links 
- Remove hashtag/numbers
- Remove punctuation

In [None]:
def cleaner(tweet):
    """
    this function takes in a dataframe and perform the following:
    -Convert letters to lowercases
    -remove URL links
    -remove # from hashtags
    -remove numbers
    -remove punctuation
    from the text field then return a clean dataframe 
    """
    tweet = tweet.lower()
    to_del = [
        r"@[\w]*",  # strip account mentions
        r"http(s?):\/\/.*\/\w*",  # strip URLs
        r"#\w*",  # strip hashtags
        r"\d+",  # delete numeric values
        r"U+FFFD",  # remove the "character note present" diamond
    ]
    for key in to_del:
        tweet = re.sub(key, "", tweet)
    
    # strip punctuation and special characters
    tweet = re.sub(r"[,.;':@#?!\&/$]+\ *", " ", tweet)
    # strip excess white-space
    tweet = re.sub(r"\s\s+", " ", tweet)
    
    return tweet.lstrip(" ")

In [None]:
train_copy['message'] = train_copy['message'].apply(cleaner)

In [None]:
train_copy.tail(5)

### 3.2.2 Removing Stop Words

The stop words are the most common words like "if", "but", "we", "he", "she" and "she". We can usually remove these words without changing the semantics of any text, and doing so often (but not always) improves the performance of a model. Removing these stop words becomes much more useful when we use longer sequences of words as model features.

In [None]:
stop_word = stopwords.words('english')
train_copy['message'] = train_copy['message'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_word)]))

In [None]:
train_copy.head(5)

### 3.2.3 Tokenisation

Tokenization is a process of breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph. Tokenization is the process of splitting a string into a list of tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words. For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’

In [None]:
tokeniser = TreebankWordTokenizer()
train_copy['tokens'] = train_copy['message'].apply(tokeniser.tokenize)

In [None]:
train_copy.head(5)

### 3.2.4 Lemmatisation

Lemmatization is a technique used to extract the base form of words by removing affixes from them and combining common words. It is the process of transforming words into the dictionary base form. These words are linked together based on their semantic relationships. The linking is dependent on the meanings of the words. In particular, we utilize WordNet.

In [None]:
def lemmas(words, lemmatizer):
    return [lemmatizer.lemmatize(word) for word in words]

In [None]:
lemmatizer = WordNetLemmatizer()
train_copy['lemma'] = train_copy['tokens'].apply(lemmas, args=(lemmatizer, ))

In [None]:
train_copy.head(5)

# 4. Text Feature Extraction

## 4.1 Splitting out the X variable from the target

In [None]:
y = train_copy['sentiment']
X = train_copy['message']

## 4.2 Data tranformation with TfidfVectorizer

The Tfidf will be used to transform our data, Tfidf assigns word frequency scores, these scores try to highlight words of greater interest. The TFIDFVectorizer will tokenize the documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

In [None]:
vectorizer = TfidfVectorizer()
X_vectorized = vectorizer.fit_transform(X)

# 5. Model Building

## 5.1 Splitting the training data into a training and validation set

The training data set is split into training and validation dataset. A validation dataset is a sample of data held back from training the model and is used to give an estimate of model skill while tuning the model’s hyperparameters. The validation dataset is different from the test dataset that is also held back from the training of the model but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models.

In [None]:
X_train,X_val,y_train,y_val = train_test_split(X_vectorized,y,test_size=.3,shuffle=True, stratify=y, random_state=11)

## 5.2 Model Fitting

### 5.2.1 Random Forest Classifier

Random forests is a supervised learning algorithm. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance. It is said that the more trees it has, the more robust a forest is. 

In [None]:
from IPython.display import Image
Image('rf.png', width="800" ,height="400")

In [None]:
rfc = RandomForestClassifier(n_estimators=100,random_state=42)
rfc.fit(X_train, y_train)

### 5.2.2 Logistic Classifier

Logistic Regression is a supervised machine learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X. Logistic regression is good for binary classes, but in our case there is more than two classes. One-vs-rest(or OvR) approach will be used to combine the logistic regression models. In the OvR case, a separate logistic regression model has trained for each label that the response variable takes on.

In [None]:
from IPython.display import Image
Image('logistic.jpg', width="800" ,height="300")

In [None]:
lmc = LogisticRegression(multi_class='ovr')
lmc.fit(X_train, y_train)

### 5.2.3 Decision Tree Classifier

The decision tree model is a supervised machine learning model classification that is in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

In [None]:
from IPython.display import Image
Image('decision_tree.png', width="800" ,height="300")

In [None]:
dtc = DecisionTreeClassifier(random_state=42)

In [None]:
dtc.fit(X_train, y_train)

### 5.2.4 Support vector machine Classifier

Support Vector Machine (SVM) is a supervised machine learning algorithm. It works by drawing a straight line hyperplane between two classes.  The data points that fall on one side of the line will be labeled as one class and, the points that fall on the other side will be labeled as the second.  There are many possible hyperplanes that could be chosen, but the main objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. 

In [None]:
from IPython.display import Image
Image('svm.png', width="600" ,height="400")

In [None]:
svc = SVC(kernel='linear')
svc.fit(X_train, y_train)

## 6. Model Evaluation

The base model will be evaluated using the validation dataset that was kept aside from the training data.  After a test dataset will be used to make predictions. That will help us to understand whether we are overfitting our model or not.

## 6.1 Model evaluation using validation data

The training data set was split into a training set and an evaluation set. The evaluation set will be used to evaluate the model before evaluated using the test dataset.

In [None]:
# Random forest Predict
rfc_pred = rfc.predict(X_val)
# Multi-class Logistic Predict
lmc_pred = lmc.predict(X_val)
#Decision Tree Predict
dtc_pred = dtc.predict(X_val)
# Support vector Machine Predict
svc_pred = svc.predict(X_val)


## 6.2 Model evaluation using test data

## 6.2.1 Data tranformation with Vectorizer

In [None]:
testx = test_copy['message']
test_vect = vectorizer.transform(testx)

## 6.2.2 Making predictions on the test set 

In [None]:
# Random Forest
rfc_pred_t = rfc.predict(test_vect)
# Multi-class Logistic Predict
lmc_pred_t = lmc.predict(test_vect)
#Decision Tree Predict
dtc_pred_t = dtc.predict(test_vect)
# Support vector Machine Predict
svc_pred_t = svc.predict(test_vect)


# 7. Model Analysis

The performance of a clssification model is based on the counts of test record corrently and incorrectly predicted by the model.

## 7.1 Classification Report

The Classification Report gives us more information on where our model is going wrong - looking specifically at the performance caused by Type I & II errors. The following metrics are calculated as part of the classification report.

**Precision**

When it predicts yes, how often is it correct?
$$ Precision = \frac{TP}{TP \space + FP} = \frac{TP}{Total \space Predicted \space Positive} $$

**Recall**

When the outcome is actually _yes_, how often do we predict it as such?

$$ Recall = \frac{TP}{TP \space + FN} = \frac{TP}{Total \space Actual \space Positive}$$

**F1 Score**

Weighted average of precision and recall. 

$$F_1 = 2 \times \frac {Precision \space \times \space Recall }{Precision \space + \space Recall }$$


### 7.1.1 Random Forest Classifier

In [None]:
print("Classification Report for Validation Dataset")
print(classification_report(y_val, rfc_pred))


### 7.1.2 Logistic Classifier

In [None]:
print(classification_report(y_val, lmc_pred, target_names=['Anti', 'Nuetral','Pro','News']))


### 7.1.3 Decision Tree Classifier

In [None]:
print(classification_report(y_val, dtc_pred, target_names=['Anti', 'Nuetral','Pro','News']))


### 7.1.4 Support vector machine Classifier

In [None]:
print(classification_report(y_val, svc_pred, target_names=['Anti', 'Nuetral','Pro','News']))


## 7.2 Overall f1-score

In [None]:
# Random Forest
rfc_f1=f1_score(y_val, rfc_pred, average="macro")
# Logistic Model
lmc_f1=f1_score(y_val, lmc_pred, average="macro")
#Decision Tree
dtc_f1=f1_score(y_val, dtc_pred, average="macro")
#Support Vector Machine
svc_f1=f1_score(y_val, svc_pred, average="macro")

# 8. Submitions

Adding a sentiment column to our original test df

In [None]:
test['sentiment'] = svc_pred_t
test.head()

Creating an output csv for submission

In [None]:
test[['tweetid','sentiment']].to_csv('testsubmission.csv', index=False)

The base models did not perform so well in the Kaggle leaderboard and, that is because they were all using default hyperparameters.  In the next following sections, we will look at ways to improve our models by tunning them. Model tuning allows you to customize your models so that they can generate the most accurate outcomes and give you highly valuable insights into your data.

# Tuned and Improved Model

# 9. Data Cleaning

In an attempt to improve the machine models, we will start from scratch with the data preprocessing as it might optimize also the process of data cleaning.

In [None]:
# Ignore warnings
import warnings
warnings.simplefilter(action='ignore')

# Install Prerequisites
# import sys
# import nltk
# !{sys.executable} -m pip install bs4 lxml wordcloud scikit-learn scikit-plot
# nltk.download('vader_lexicon')

# Exploratory Data Analysis
import re
import ast
import time
import nltk
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
#from textblob import TextBlob
import matplotlib.pyplot as plt
#from wordcloud import WordCloud
from nltk.sentiment import SentimentIntensityAnalyzer

# Data Preprocessing
import string
from bs4 import BeautifulSoup
from collections import Counter
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from sklearn.utils import resample
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Classification Models
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Performance Evaluation
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import GridSearchCV
#from scikitplot.metrics import plot_roc, plot_confusion_matrix
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, classification_report, confusion_matrix

# Display
%matplotlib inline
sns.set(font_scale=1)
sns.set_style("white")
from sklearn.metrics import plot_roc_curve

In [None]:
#train_data = pd.read_csv('train.csv')
#test_data = pd.read_csv('test.csv')
#train_data = pd.read_csv('/train.csv')
#test_data = pd.read_csv('/test.csv')
train_data = train.copy() #For EDA on raw data
test_data = test.copy()

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
# Final Cleaning
def sentiment_changer(df):
    """
    Change key words to reflect the general sentiment associated with it.
    """
    df['message'] = df['message'].apply(lambda x: x.replace('global', 'negative'))
    df['message'] = df['message'].apply(lambda x: x.replace('climate', 'positive'))
    df['message'] = df['message'].apply(lambda x: x.replace('MAGA', 'negative'))
    return df['message']

train_data['message'] = sentiment_changer(train_data)
test_data['message'] = sentiment_changer(test_data)

def clean(df):
    """
    Apply data cleaning steps to raw data.
    """
    df['token'] = df['message'].apply(TweetTokenizer().tokenize) ## first we tokenize
    df['punc'] = df['token'].apply(lambda x : [i for i in x if i not in string.punctuation])## remove punctuations
    df['dig'] = df['punc'].apply(lambda x: [i for i in x if i not in list(string.digits)]) ## remove digits
    df['final'] = df['dig'].apply(lambda x: [i for i in x if len(i) > 1]) ## remove all words with only 1 character
    return df['final']

train_data['final'] = clean(train_data)
test_data['final'] = clean(test_data)

### Resampling
We addressed the problem of imbalanced training data by resampling the data before building our models. A class size was determined based on the second largest sentiment class and other classes were either upsampled or downsampled according to the class size. However, resampling the data did not improve the performance of the models and we therefore excluded it.

## Lemmatisation

Lemmatisation aims to remove inflectional word endings to return the base or dictionary form of a word, also known as "lemma". We used the WordNetLemmatizer() from nltk, as well as by way of applying part of speech.

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
def get_part_of_speech(word):
    """
    Find part of speech of word if part of speech is either noun, verb, adjective etc and add it to a list.
    """
    probable_part_of_speech = wordnet.synsets(word) ## finding word that is most similar (synonyms) for semantic reasoning
    pos_counts = Counter() ## instantiating our counter class
    pos_counts["n"] = len([i for i in probable_part_of_speech if i.pos()=="n"])
    pos_counts["v"] = len([i for i in probable_part_of_speech if i.pos()=="v"])
    pos_counts["a"] = len([i for i in probable_part_of_speech if i.pos()=="a"])
    pos_counts["r"] = len([i for i in probable_part_of_speech if i.pos()=="r"])
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0] ## will extract the most likely part of speech from the list
    return most_likely_part_of_speech

normalizer = WordNetLemmatizer()

train_data['final'] = train_data['final'].apply(lambda x: [normalizer.lemmatize(token, get_part_of_speech(token)) for token in x])
test_data['final'] = test_data['final'].apply(lambda x: [normalizer.lemmatize(token, get_part_of_speech(token)) for token in x])

## Split Training and Validation Sets

Training data: Data that contains a known label. The model is trained on this data to be able to generalize unlabeled data.
Validation data: A subset of the training data that is used to assess how well the algorithm was trained on the training data.
Test data: Data that is used to provide an unbiased evaluation of the final model fit on the training dataset.

In [None]:
X = train_data['final']
y = train_data['sentiment']
X_test = test_data['final']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state = 42)

# 10. Feature Extraction

The TfidfVectorizer transforms text to feature vectors that can be used as input to a classification model.

In [None]:
X_train = list(X_train.apply(' '.join))
X_val = list(X_val.apply(' '.join))

vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df = 0.3, min_df = 5, ngram_range = (1, 2))
vectorizer.fit(X_train)

# vect_save_path = "TfidfVectorizer.pkl"
# with open(vect_save_path,'wb') as file:
#     pickle.dump(vectorizer,file)

X_train = vectorizer.transform(X_train)
X_val = vectorizer.transform(X_val)

# 11. Modelling

## logistic regression

Logistic regression is a statistical model that makes use of a logistic function to model a binary dependent variable, however, multiclass classification with logistic regression can be done through the one-vs-rest scheme in which a separate model is trained for each class to predict whether an observation is that class or not (thus making it a binary classification problem).

In [None]:
modelstart = time.time()
logreg = LogisticRegression(C=1000, multi_class='ovr', solver='saga', random_state=42, max_iter=10)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_val)
logreg_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
logreg_precision = round(precision_score(y_val, y_pred, average='weighted'),4)
logreg_recall = round(recall_score(y_val, y_pred, average='weighted'),4)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
results = pd.DataFrame(report).transpose()
results

## Multinomial Naive Bayes

The Multinomial Naive Bayes model estimates the conditional probability of a particular feature given a class and uses a multinomial distribution for each of the features. The model assumes that each feature makes an independent and equal contribution to the outcome.

In [None]:
modelstart= time.time()
multinb = MultinomialNB()
multinb.fit(X_train, y_train)
y_pred = multinb.predict(X_val)
multinb_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
multinb_precision = round(precision_score(y_val, y_pred, average='weighted'),4)
multinb_recall = round(recall_score(y_val, y_pred, average='weighted'),4)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
results = pd.DataFrame(report).transpose()
# results.to_csv("multinb_report.csv")
results

## Random Forest Classifier

Random forest models are an example of an ensemble method that is built on decision trees (i.e. it relies on aggregating the results of an ensemble of decision trees). Decision tree machine learning models represent data by partitioning it into different sections based on questions asked of independent variables in the data. Training data is placed at the root node and is then partitioned into smaller subsets which form the 'branches' of the tree. In random forest models, the trees are randomized and the model returns the mean prediction of all the individual trees.

In [None]:
modelstart = time.time()
rf = RandomForestClassifier(max_features=4, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val)
rf_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
rf_precision = round(precision_score(y_val, y_pred, average='weighted'),4)
rf_recall = round(recall_score(y_val, y_pred, average='weighted'),4)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
pd.DataFrame(report).transpose()

## Support Vector Classifier

A Support Vector Classifier is a discriminative classifier formally defined by a separating hyperplane. When labelled training data is passed to the model, also known as supervised learning, the algorithm outputs an optimal hyperplane which categorizes new data.

In [None]:
modelstart = time.time()
svc = SVC(gamma = 0.8, C = 10, random_state=42)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_val)
svc_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
svc_precision = round(precision_score(y_val, y_pred, average='weighted'),4)
svc_recall = round(recall_score(y_val, y_pred, average='weighted'),4)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
results = pd.DataFrame(report).transpose()
results

In [None]:
name = 'svm.pkl'

with open (name, 'wb') as file:
  

## Linear SVC


The objective of a Linear Support Vector Classifier is to return a "best fit" hyperplane that categorises the data. It is similar to SVC with the kernel parameter set to ’linear’, but it is implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and can scale better to large numbers of samples.

In [None]:
modelstart = time.time() 
linsvc = LinearSVC()
linsvc.fit(X_train, y_train)
y_pred = linsvc.predict(X_val)
linsvc_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
linsvc_precision = round(precision_score(y_val, y_pred, average='weighted'),4)
linsvc_recall = round(recall_score(y_val, y_pred, average='weighted'),4)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
results = pd.DataFrame(report).transpose()
results

## K Neighbours Classifier

The K Neighbours Classifier is a classifier that implements the k-nearest neighbours vote. In classification, the output is a class membership. An object is classified by a plurality vote of its neighbours, with the object being assigned to the class most common among its k-nearest neighbours.

In [None]:
modelstart = time.time()
kn = KNeighborsClassifier(n_neighbors=1)
kn.fit(X_train, y_train)
y_pred = kn.predict(X_val)
kn_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
kn_precision = round(precision_score(y_val, y_pred, average='weighted'),4)
kn_recall = round(recall_score(y_val, y_pred, average='weighted'),4)
print('accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
results = pd.DataFrame(report).transpose()
results

## Decision Tree Classifier

Decision tree machine learning models represent data by partitioning it into different sections based on questions asked of independent variables in the data. Training data is placed at the root node and is then partitioned into smaller subsets which form the 'branches' of the tree.

In [None]:
modelstart = time.time()
dt = DecisionTreeClassifier(random_state=42)    
dt.fit(X_train, y_train)
y_pred = dt.predict(X_val)
dt_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
dt_precision = round(precision_score(y_val, y_pred, average='weighted'),4)
dt_recall = round(recall_score(y_val, y_pred, average='weighted'),4)
print('accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
pd.DataFrame(report).transpose()

## AdaBoost Classifier

The AdaBoost classifier is an iterative ensemble method that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset. In the second step, the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

In [None]:
modelstart = time.time()
ad = AdaBoostClassifier(random_state=42)
ad.fit(X_train, y_train)
y_pred = ad.predict(X_val)
ad_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
ad_precision = round(precision_score(y_val, y_pred, average='weighted'),4)
ad_recall = round(recall_score(y_val, y_pred, average='weighted'),4)
print('accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
pd.DataFrame(report).transpose()

## 12. Model Performance

### Performance Metrics of Best Models

We built and tested eight different classification models and compared their performance using a statistical measure known as the weighted F1 score, which takes into account the proportions of each class fed into the model. This is a weighted average of the precision and recall of the model and is the measure that will be used to test the accuracy of our Kaggle output. 

#### Precision

When it predicts "True", how often is it correct? 

$$ Precision = \frac{TP}{TP \space + FP} = \frac{TP}{Total \space Predicted \space Positive} $$

#### Recall

When the outcome is actually "True", how often do we predict it as such?

$$ Recall = \frac{TP}{TP \space + FN} = \frac{TP}{Total \space Actual \space Positive}$$

#### F1 Score

Weighted average of precision and recall. 

$$F_1 = 2 \times \frac {Precision \space \times \space Recall }{Precision \space + \space Recall }$$

In [None]:
# Compare Weighted F1-Scores Between Models
fig,axis = plt.subplots(figsize=(10, 5))
rmse_x = ['Multinomial Naive Bayes','Logistic Regression','Random Forest Classifier','Support Vector Classifier','Linear SVC','K Neighbours Classifier','Decision Tree Classifier','AdaBoost Classifier']
rmse_y = [multinb_f1,logreg_f1,rf_f1,svc_f1,linsvc_f1,kn_f1,dt_f1,ad_f1]
ax = sns.barplot(x=rmse_x, y=rmse_y,palette='winter')
plt.title('Weighted F1-Score Per Classification Model',fontsize=14)
plt.xticks(rotation=90)
plt.ylabel('Weighted F1-Score')
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2, p.get_y() + p.get_height(), round(p.get_height(),2), fontsize=12, ha="center", va='bottom')
    
plt.show()

From the performance metrics, we see that the **Support Vector Classifier** performed the best on our validation set, closely followed by the **Linear SVC** and **Logistic Regression** models. The K Neighbours Classifier significantly performed the worst, which may be due to the k value that was selected for the model. To ensure that we get a robust measure of classifier performance, we will apply cross validation and hyperparameter tuning on the top three performing models.

## 13. Hyperparameter Tuning of Best Models

**Cross validation** is a technique used to test the accuracy of a model's prediction on unseen data (validation sets). This is important because it can assist in picking up issues such as over/underfitting and selection bias. We used the K-fold technique to perform cross validation. 

**Hyperparameter tuning** is the process by which a set of ideal hyperparameters are chosen for a model. A hyperparameter is a parameter for which the value is set manually and tuned to control the algorithm's learning process.

**Logistic Regression**

In [None]:
LogisticRegression().get_params()

In [None]:
param_grid = {'C': [1000], #[100,1000]
              'max_iter': [10], #[10,100]
              'multi_class': ['ovr'], #['ovr', 'multinomial']
              'random_state': [42],
              'solver': ['saga']} #['saga','lbfgs']
grid_LR = GridSearchCV(LogisticRegression(), param_grid, scoring='f1_weighted', cv=5, n_jobs=-1)
grid_LR.fit(X_train, y_train)
y_pred = grid_LR.predict(X_val)
print("Best parameters:")
print(grid_LR.best_params_)
print('accuracy %s' % accuracy_score(y_pred, y_val))
print(classification_report(y_val, y_pred))

#### Linear SVC

In [None]:
LinearSVC().get_params()

In [None]:
param_grid = {'C': [100],#[0.1,1,10,100,1000]
              'max_iter': [10], #[10,100]
              'multi_class' : ['ovr'], #['crammer_singer', 'ovr']
              'random_state': [42]} 
grid_LSVC = GridSearchCV(LinearSVC(), param_grid, scoring='f1_weighted', cv=5, n_jobs=-1)
grid_LSVC.fit(X_train, y_train)
y_pred = grid_LSVC.predict(X_val)
print(grid_LSVC.best_params_)
print('accuracy %s' % accuracy_score(y_pred, y_val))
print(classification_report(y_val, y_pred))

#### Support Vector Classifier

In [None]:
SVC().get_params()

In [None]:
param_grid = {'C': [10],#[0.1,1,10,100,1000]
              'gamma': [0.8], #[0.8,1]
              'kernel': ['rbf'], #['linear','rbf']
              'random_state': [42]} 
grid_SVC = GridSearchCV(SVC(), param_grid, scoring='f1_weighted', cv=5, n_jobs=-1)
grid_SVC.fit(X_train, y_train)
y_pred = grid_SVC.predict(X_val)
print(grid_SVC.best_params_)
print('accuracy %s' % accuracy_score(y_pred, y_val))
print(classification_report(y_val, y_pred))

In [None]:
# Random Forest
rfc_f1=f1_score(y_val, rfc_pred, average="macro")
# Logistic Model
lmc_f1=f1_score(y_val, lmc_pred, average="macro")
#Decision Tree
dtc_f1=f1_score(y_val, dtc_pred, average="macro")
#Support Vector Machine
svc_f1=f1_score(y_val, svc_pred, average="macro")
# AdaBoost Classifier
# K Neighbours Classifier
#Linear SVC
#Multinomial Naive Bayes

# 14. Model Analysis

We used a TF-IDF vectorizer to compute a weight for each word token by its level of importance and vectorize it and we used a radial basis function support vector classifier (SVC) to train our model. After a bit of hyperparameter tuning, we found the following parameters to work well: {'C': 10, 'gamma': 0.8, 'kernel': 'rbf', 'random_state': 42}. A token pattern of alphanumeric words performed best and since the average tweet has around 17 words, an n-gram of 1 to 2 performs best in capturing semantic meaning. The SVC parameters were chosen because the radial basis function performs better than a Linear SVC at splitting up the areas in which the different semantic lies. This is possibly due to the fact that the classification is not binary.

#### Performance Metrics

In [None]:
y_pred = svc.predict(X_val)
print('accuracy %s' % accuracy_score(y_pred, y_val))
print(classification_report(y_val, y_pred))

### Results

In [None]:
# Make prediction on test data
X = train_data['final']
y = train_data['sentiment']
X_test = test_data['final']

X = list(X.apply(' '.join))
X_test = list(X_test.apply(' '.join))

vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df = 0.3, min_df = 5, ngram_range = (1, 2))
vectorizer.fit(X)

X = vectorizer.transform(X)
X_test = vectorizer.transform(X_test)

svc = SVC(gamma=0.8, C=10, random_state=42)
svc.fit(X, y)
y_test = svc.predict(X_test)

In [None]:
# Number of Tweets Per Sentiment Class
fig, axis = plt.subplots(ncols=2, figsize=(15, 5))

ax = sns.countplot(y_test,palette='winter',ax=axis[0])
axis[0].set_title('Number of Tweets Per Sentiment Class',fontsize=14)
axis[0].set_xlabel('Sentiment Class')
axis[0].set_ylabel('Tweets')
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=11, ha='center', va='bottom')

results = pd.DataFrame({"tweetid":test_data['tweetid'],"sentiment": y_test})
results['sentiment'].value_counts().plot.pie(autopct='%1.1f%%',colormap='winter_r',ax=axis[1])
axis[1].set_title('Proportion of Tweets Per Sentiment Class',fontsize=14)
axis[1].set_ylabel('Sentiment Class')
    
plt.show()

# 15. Save Output

In [None]:
# Create Kaggle Submission File
results = pd.DataFrame({"tweetid":test_data['tweetid'],"sentiment": y_test})
results.to_csv("Team2_final_submission.csv", index=False)

# 16. Conclusion

In this project, we succeeded in building a supervised machine learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data. Our top performing model has a weighted F1 score of 0.78, based on our validation set, and the results from our testing set are in line with what was observed in the training set. We think that it is possible that the number of Pro tweets is related to the fact that "97% or more of actively publishing climate scientists agree: climate-warming trends over the past century are extremely likely due to human activities." ([Nasa](https://climate.nasa.gov/scientific-consensus/#*))

**Impact investing** is an emerging field that refers to investments made into companies and organisations with the intention to generate measurable social or environmental impact alongside financial return. Many companies are built around lessening one’s environmental impact or carbon footprint and they offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. These companies would like to determine how people perceive climate change and whether or not they believe it is a real threat. Our model provides a valuable solution to this problem and can add to their market research efforts in gauging how their product or service may be received. It gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories, thus increasing their insights and informing future marketing strategies.

From our exploratory data analysis, we can draw some marketing-related insights. For maximum reach in marketing campaigns that target a specific group of people that have a certain stance with regard to climate change, a marketing team can consider the following:




  
The rise of impact investment has caused companies to focus on generating a positive social and environmental impact in addition to financial returns. It would assist companies to ally their brand and products with the Pro climate change movement. Pro climate change tweets tend to have a wider reach than other classes. Not only is it an ethical stance but it has potential to increase exposure of the brand on Twitter. Their tweets could be used to add their voice to the fight against global warming and thus be expressed as a negative sentiment or possibly nuetral. This could maximize their reach even further and also introduces other considerations, such as financial rewards due to carbon taxes.


Twitter hashtags are a powerful tool that companies can use to reach a larger audience,this can be achieved by engaging in the conversations on climate change by utilising the hashtags that the users that are Pro-Climate change use.This will communicate the company's commitment to man-made climate change.The image of a company plays a vital role in differentiating a company from its competitors as it communicates who it caters to. This minor yet impactful act of using pro-climate change hashtags will improve the image of the business.

# 17. Comet

In [None]:
#Create dictionaries for the data we want to log
params={'random_state':7,
        'model_type':'lmc',
        'stratify':True
}
metrics = {'RFC_F': rf_f1,
           'Logreg_F1': logreg_f1,
           'DTC_F1':dt_f1,
           'SVC_F1':svc_f1,
           'Multinb_F1':multinb_f1,
           'linsvc_F1':linsvc_f1,
           'Kn_precision':kn_f1,
           'AdaB_F1':ad_f1}
           
           # Recall
# metrics2={'Logreg_recall':logreg_recall,
#            'Multinb_recall':multinb_recall,
#            'RFC_recall': rf_recall,
#            'SVC_recall':svc_recall,
#            'linsvc_recall':linsvc_recall,
#            'kn_recall':kn_recall,
#            'DTC_recall':dt_recall,
#            'AdaB_recall':ad_recall}
#            #precisiom
# metrics3={ 'Logreg_precision':logreg_precision,
#            'Multinb_precision':multinb_precision,
#            'RFC_precision':rf_precision,
#            'SVC_precision':svc_precision,
#            'Linsvc_precision':linsvc_precision,
#            'kn_precision':kn_precision,
#            'DTC_precision':dt_precision,
#            'AdaB_precision':ad_precision
# }

In [None]:
experiment.log_parameters(params)
experiment.log_metrics(metrics)
#experiment.log_metrics(metrics2)
#experiment.log_metrics(metrics3)

In [None]:
experiment.end()

# 18. References

1. https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4
2. https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958
3. 