# Tweet Sentiment Extraction


## 1. Introduction

* With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

* The goal of this competition is to extract those word or phrases which determines the sentiment of whole tweet.

> Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.



### 1.1 About Tweet Sentiment Extraction Dataset

The data folder contains following files, all in csv format

**Files**
* `train.csv` - the training set
* `test.csv` - the test set
* `sample_submission.csv` - a sample submission file in the correct format

**Columns**
* `textID` - unique ID for each piece of text
* `text` - the text of the tweet
* `sentiment` - the general sentiment of the tweet
* `selected_text` - [train only] the text that supports the tweet's sentiment

### 1.3 Competition metric:

The metric in this competition is the word-level Jaccard score.
Jaccard similarity or intersection over union is defined as size of intersection divided by size of union of two sets. Let’s take example of two sentences:

`Sentence 1: AI is our friend and it has been friendly`

`Sentence 2: AI and humans have always been friendly`

In order to calculate similarity using Jaccard similarity, we will first perform lemmatization to reduce words to the same root word. In our case, “friend” and “friendly” will both become “friend”, “has” and “have” will both become “has”. Drawing a Venn diagram of the two sentences we get:

![](https://miro.medium.com/max/926/1*NSK8ERXexyIZ_SRaxioFEg.png)

Please read [this](https://medium.com/@adriensieg/text-similarities-da019229c894) article for better understanding

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# !pip install chart_studio

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/ import string
from pandas_profiling import ProfileReport
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Word2Vec 
from gensim.models import KeyedVectors 
import matplotlib.pyplot as plt
import pickle
from tqdm import tqdm
import os
import nltk
import seaborn as sns
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from plotly import tools
# import chart_studio.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

from collections import Counter # suppress warnings
import warnings
warnings.filterwarnings("ignore")
sns.set(style="ticks", color_codes=True)


### 1.2 Reading Data

In [None]:
BASE_PATH = '../input/tweet-sentiment-extraction/'

train_df = pd.read_csv(BASE_PATH + 'train.csv')
test_df = pd.read_csv( BASE_PATH + 'test.csv')
submission_df = pd.read_csv( BASE_PATH + 'sample_submission.csv')

In [None]:
print("Number of data points in train data frame", train_df.shape)
print("The attributes of train data :", train_df.columns.values)
print('-'*50)
print("Number of data points in test data frame", train_df.shape)
print("The attributes of test data :", test_df.columns.values)

## 2. Profiling Dataframes

For profiling i am using [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling) library


* **train profile report**

In [None]:
train_profile = ProfileReport(train_df, title='Train Data Profiling Report', html={'style':{'full_width':True}})

In [None]:
train_profile.to_file(output_file="train_profile.html")
train_profile.to_notebook_iframe()

* **test profile report**

In [None]:
test_profile = ProfileReport(test_df, title='Test Data Profiling Report', html={'style':{'full_width':True}})

> Note: train_df contains following interesting information:
    * Text column do not have any duplicate values
    * There is only 2 missing value in entire data frame, which is quite good
    * The sentiment column has 3 possible values which are: `positive`, `negative` or `neutral`


In [None]:
test_profile.to_file(output_file="test_profile.html")
test_profile.to_notebook_iframe()

> Note: test_df contains following interesting information:
    * Text column do not have any duplicate values
    * There is no missing value in test data frame.
    * The sentiment column has 3 possible values which are: `positive`, `negative` or `neutral`

## 3. Understanding Competition Metric

We can define our own function for jaccard similarity or can simply use nltk library which contain predefined `jaccard_distance` function

In [None]:
def jaccard_similarity(text1, text2):
    intersection = set(text1).intersection(set(text2))
    union = set(text1).union(set(text2))
    return len(intersection)/len(union)

In [None]:
str1 = 'President greets the press in Chicago'
str2 = 'Obama speaks in Illinois'

In [None]:
jaccard_similarity(str1, str2)

`jacard_similarity = 1 - jacard_distance`

In [None]:
nltk.jaccard_distance(set(str1), set(str2))

In [None]:
1 - nltk.jaccard_distance(set(str1), set(str2))

* This shows that nltk has exact same implementation as ours
* str1 and str2 has `jaccard_distance of 0.36363636363636365` and `jaccard_similarity of 0.6363636363636364`


## 4. Analyzing Train Data

### 4.1 distribution of train data

In [None]:
train_df.sentiment.value_counts()

In [None]:
sns.catplot(x="sentiment", kind="count", palette="ch:.25", data=train_df);

> Note: Our train data is not symmetric as it has more neutral points as compared to positive or negative.[](http://)

### 4.2 distribution of test data

In [None]:
test_df.sentiment.value_counts()

In [None]:
sns.catplot(x="sentiment", kind="count", palette="ch:.25", data=test_df);

> Note: From above plots it is clear that both train and test data has similar distribution

### 4.3 Word clouds of `neutral`, `positive` and `negative` words

In [None]:
# https://www.datacamp.com/community/tutorials/wordcloud-python

def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0), color = 'white',
                   title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
    stopwords = stopwords.union(more_stopwords)

    
    # Create a word cloud image
    wordcloud = WordCloud(background_color=color,
                   stopwords = stopwords,
                   max_words = max_words,
                   max_font_size = max_font_size,
                   random_state = 42,
                   mask=mask,
                   width=200,
                   height=100,
                   contour_width=2, 
                   contour_color='firebrick')
    
    wordcloud.generate(str(text))
    
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  

In [None]:
train_df = train_df.dropna()

neutral_text = train_df.loc[train_df['sentiment'] == 'neutral', 'text'].append(test_df.loc[test_df['sentiment'] == 'neutral'])
positive_text = train_df.loc[train_df['sentiment'] == 'positive', 'text'].append(test_df.loc[test_df['sentiment'] == 'positive'])
negative_text = train_df.loc[train_df['sentiment'] == 'negative', 'text'].append(test_df.loc[test_df['sentiment'] == 'negative'])


In [None]:
## util to create masked image compatible for WordCloud
wine_mask = np.array(Image.open("../input/wine-mask/wine_mask.png"))

def transform_format(val):
    if val == 0:
        return 255
    else:
        return val
    
# Transform your mask into a new one that will work with the function:
transformed_wine_mask = np.ndarray((wine_mask.shape[0],wine_mask.shape[1]), np.int32)

for i in range(len(wine_mask)):
    transformed_wine_mask[i] = list(map(transform_format, wine_mask[i]))

In [None]:
plot_wordcloud(neutral_text, transformed_wine_mask, max_words=1000, max_font_size=120, title = 'Word Cloud of Neutral tweets', title_size=50)

In [None]:
plot_wordcloud(positive_text,transformed_wine_mask, max_words=1000, max_font_size=100, 
               title = 'Word Cloud of Positive tweets', title_size=50)

In [None]:
plot_wordcloud(negative_text,transformed_wine_mask, max_words=1000, max_font_size=100, 
               title = 'Word Cloud of Negative tweets', title_size=50)

In [None]:
def plot_text_features(data):
    
    fig = go.Figure()
    for val in data:
        fig.add_trace(go.Histogram(x=val['x'],name = val['label']))

    # Overlay both histograms
    fig.update_layout(barmode='stack')
    # Reduce opacity to see both histograms
    fig.update_traces(opacity=0.75)
    fig.show()
    

### 4.4 Plotting number of words in text and selected_text

In [None]:
train_num_words = train_df['text'].apply(lambda x: len(str(x).split(' ')))
test_num_words = test_df['text'].apply(lambda x: len(str(x).split(' ')))
selected_text_num_words = train_df['selected_text'].apply(lambda x: len(str(x).split(' ')))


data_num_words = [
    {'x': train_num_words, 'label': 'Num of words in text of train data'},
    {'x': test_num_words, 'label': 'Num of words in text of test data'},
    {'x': selected_text_num_words, 'label': 'Num of words in selected text'},
]

plot_text_features(data_num_words)

> Observations:
    We can observe from above histogram plot that the number of words in train text and test text ranges from 1 to 30.Selected text words mostly fall in range of 1-25.



### 4.5 Plotting number of characters in text and selected_text

In [None]:
train_num_chars = train_df['text'].apply(lambda x: len(x))
test_num_chars = test_df['text'].apply(lambda x: len(x))
selected_text_num_chars = train_df['selected_text'].apply(lambda x: len(x))


data_num_chars = [
    {'x': train_num_chars, 'label': 'Num of chars in text of train data'},
    {'x': test_num_chars, 'label': 'Num of chars in text of test data'},
    {'x': selected_text_num_chars, 'label': 'Num of chars in selected text'},
]

plot_text_features(data_num_chars)

> Observations:
    * From above plot we can see that number of characters in test and train set was in same range.
    * In selected text the range flows from 3 to 138 Characters.



### 4.6 Plotting number of unique words in text and selected_text

In [None]:
train_num_uniq_words = train_df['text'].apply(lambda x: len(set(str(x).split(' '))))
test_num_uniq_words = test_df['text'].apply(lambda x: len(set(str(x).split(' '))))
selected_text_num_uniq_words = train_df['selected_text'].apply(lambda x: len(set(str(x).split(' '))))


data_num_uniq_words = [
    {'x': train_num_uniq_words, 'label': 'Num of unique words in text of train data'},
    {'x': test_num_uniq_words, 'label': 'Num of unique words in text of test data'},
    {'x': selected_text_num_uniq_words, 'label': 'Num of unique words in selected text'},
]

plot_text_features(data_num_uniq_words)

> Observations:
    * We can see that number of unique words in train and test sets range from 1 to 30. 
    * In selected text most number of unique words lie between 1 to 30



### 4.7 Number of stop words in text and selected_text

In [None]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
stop_words = set(stopwords.words('english')) 


train_num_stop_words = train_df['text'].apply(lambda x: len([w for w in word_tokenize(x) if w in stop_words]))
test_num_stop_words = test_df['text'].apply(lambda x: len([w for w in word_tokenize(x) if w in stop_words]))
selected_text_num_stop_words = train_df['selected_text'].apply(lambda x: len([w for w in word_tokenize(x) if w in stop_words]))


data_num_stop_words = [
    {'x': train_num_stop_words, 'label': 'Num of stop words in text of train data'},
    {'x': test_num_stop_words, 'label': 'Num of stop words in text of test data'},
    {'x': selected_text_num_stop_words, 'label': 'Num of stop words in selected text'},
]

plot_text_features(data_num_stop_words)

> Observations"
    * All text columns has most number of stop words in range 0-15.

### 4.8 Number of punctuations in text and selected_text

In [None]:
from string import punctuation



train_num_puncs = train_df['text'].apply(lambda x: len([w for w in word_tokenize(x) if w in punctuation]))
test_num_puncs = test_df['text'].apply(lambda x: len([w for w in word_tokenize(x) if w in punctuation]))
selected_text_num_puncs = train_df['selected_text'].apply(lambda x: len([w for w in word_tokenize(x) if w in punctuation]))


data_num_puncs = [
    {'x': train_num_puncs, 'label': 'Num of punctuation in text of train data'},
    {'x': test_num_puncs, 'label': 'Num of punctuation in text of test data'},
    {'x': selected_text_num_puncs, 'label': 'Num of punctuation in selected text'},
]

plot_text_features(data_num_puncs)


> Observations:
    * Number of punctuations varies from 0 to 100
    * Most of the values lie between 0 to 10

## 4.9 number of words in text category wise

In [None]:
neutral_text = train_df.loc[train_df['sentiment'] == 'neutral', 'text'].append(test_df.loc[test_df['sentiment'] == 'neutral'])
positive_text = train_df.loc[train_df['sentiment'] == 'positive', 'text'].append(test_df.loc[test_df['sentiment'] == 'positive'])
negative_text = train_df.loc[train_df['sentiment'] == 'negative', 'text'].append(test_df.loc[test_df['sentiment'] == 'negative'])

neutral_text_num_words = neutral_text['text'].apply(lambda x: len(str(x).split(' ')))
positive_text_num_words = positive_text['text'].apply(lambda x: len(str(x).split(' ')))
negative_text_num_words = negative_text['text'].apply(lambda x: len(str(x).split(' ')))


data_num_words = [
    {'x': neutral_text_num_words, 'label': 'Num of words in neutral text'},
    {'x': positive_text_num_words, 'label': 'Num of words in positive text'},
    {'x': negative_text_num_words, 'label': 'Num of words in negative text'},
]

plot_text_features(data_num_words)



## 5. Training Model


About to come

## 6. Simple Submission

In [None]:
# test_df['selected_text'] = test_df['text']
# test_df.loc[test_df.sentiment != 'neutral', 'selected_text'] = test_df.loc[test_df['sentiment'] != 'neutral', 'text'].apply(lambda x: " ".join(x.strip().split(' ')[-5:]))

submission_df['selected_text'] = test_df['text']
submission_df.to_csv("submission.csv", index=False)
display(submission_df.head(10))

Little Help taken from these kernels

* https://www.kaggle.com/ratan123/sentiment-extraction-understanding-metric-eda


Note: If you like my work, please, upvote ☺