
# More To Come. Stay Tuned. !!
If there are any suggestions/changes you would like to see in the Kernel please let me know :). Appreciate every ounce of help!

Please leave any comments about further improvements to the notebook! Any feedback or constructive criticism is greatly appreciated!. 

### <span style="color:green;">If you like it or it helps you , you can upvote and/or leave a comment :)</span>


![](https://miro.medium.com/max/1400/1*VT7AxioAGXplMe7RAEYfSA.png)


- <a href='#1'>1. Introduction</a>  
- <a href='#2'>2. Retrieving the Data</a>
     - <a href='#2-1'>2.1 Load libraries</a>
     - <a href='#2-2'>2.2 Read the Data</a>
- <a href='#3'>3. Glimpse of Data</a>
     - <a href='#3-1'>3.1 Overview of tables</a>
     - <a href='#3-2'>3.2 Statistical overview of the Data</a>
- <a href='#4'>4. Check for missing data</a>
- <a href='#5'>5. Data Exploration</a>
    - <a href='#5-1'>5.1 Distribution for Text Length</a>
    - <a href='#5-2'>5.2 Distribution for Selected Text Length</a>
    - <a href='#5-3'>5.3 Word frequency in Text</a>
    - <a href='#5-4'>5.4 Word frequency in Selected Text</a>
- <a href='#6'>6. Sample Submission</a>

# <a id='1'>1. Introduction</a>

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

Help build your skills in this important area with this broad dataset of tweets. Work on your technique to grab a top spot in this competition. What words in tweets support a positive, negative, or neutral sentiment? How can you help make that determination using machine learning tools?

 # <a id='2'>2. Retrieving the Data</a>

 ## <a id='2-1'>2.1 Load libraries</a>

In [None]:
import pandas as pd # package for high-performance, easy-to-use data structures and data analysis
import numpy as np # fundamental package for scientific computing with Python
import matplotlib
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
import plotly.offline as py
py.init_notebook_mode(connected=True)
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
#import cufflinks and offline mode
import cufflinks as cf
cf.go_offline()

# Venn diagram
from matplotlib_venn import venn2
import re
import nltk
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
eng_stopwords = stopwords.words('english')
import gc

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [None]:
import os
base_dr = "../input/tweet-sentiment-extraction"
print(os.listdir(base_dr))

# <a id='2-2'>2.2 Reading Data</a>

In [None]:
print('Reading data...')
train_data = pd.read_csv(base_dr+'/train.csv')
test_data = pd.read_csv(base_dr+'/test.csv')
sample_submission = pd.read_csv(base_dr+'/sample_submission.csv')
print('Reading data completed')

In [None]:
print('Size of train_data', train_data.shape)
print('Size of test_data', test_data.shape)
print('Size of sample_submission', sample_submission.shape)

# <a id='3'>3. Glimpse of Data</a>

## <a id='3-1'>3.1 Overview of tables</a>

In [None]:
display(train_data.head())
display(test_data.head())

## <a id='3-2'> 3.2 Statistical overview of the Data</a>

In [None]:
display(train_data.describe())
display(test_data.describe())

# <a id='4'> 4 Check for missing data</a>

In [None]:
# checking missing data
total = train_data.isnull().sum().sort_values(ascending = False)
percent = (train_data.isnull().sum()/train_data.isnull().count()*100).sort_values(ascending = False)
missing_train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_train_data.head()

In [None]:
# checking missing data
total = test_data.isnull().sum().sort_values(ascending = False)
percent = (test_data.isnull().sum()/test_data.isnull().count()*100).sort_values(ascending = False)
missing_test_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_test_data.head()

# <a id='5'>5. Data Exploration</a>

# <a id='5-1'>5.1 Distribution for Text Length</a>

In [None]:
train_question_title=train_data['text'].str.len()
test_question_title=test_data['text'].str.len()
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,6))
sns.distplot(train_question_title,ax=ax1,color='blue')
sns.distplot(test_question_title,ax=ax2,color='green')
ax2.set_title('Distribution in test data')
ax1.set_title('Distribution in Training data')
plt.show()

# <a id='5-2'>5.2 Distribution for Selected Text Length</a>

In [None]:
train_question_title=train_data['selected_text'].str.len()
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,6))
sns.distplot(train_question_title,ax=ax1,color='blue')
ax1.set_title('Distribution in Training data')
plt.show()

## <a id='5-3'>5-3 Word frequency in Text</a>

In [None]:
# training data
freq_dist = FreqDist([word for text in train_data['text'].str.replace('[^a-za-z0-9^,!.\/+-=]',' ') for word in str(text).split()])
plt.figure(figsize=(20, 7))
plt.title('Word frequency (Training Data)').set_fontsize(25)
plt.xlabel('').set_fontsize(25)
plt.ylabel('').set_fontsize(25)
freq_dist.plot(60,cumulative=False)
plt.show()

# test data
freq_dist = FreqDist([word for text in test_data['text'] for word in str(text).split()])
plt.figure(figsize=(20, 7))
plt.title('Word frequency (Test Data)').set_fontsize(25)
plt.xlabel('').set_fontsize(25)
plt.ylabel('').set_fontsize(25)
freq_dist.plot(60,cumulative=False)
plt.show()

## <a id='5-4'>5-4 Word frequency in Selected Text</a>

In [None]:
# training data
freq_dist = FreqDist([word for text in train_data['selected_text'].str.replace('[^a-za-z0-9^,!.\/+-=]',' ') for word in str(text).split()])
plt.figure(figsize=(20, 7))
plt.title('Word frequency (Training Data)').set_fontsize(25)
plt.xlabel('').set_fontsize(25)
plt.ylabel('').set_fontsize(25)
freq_dist.plot(60,cumulative=False)
plt.show()

## <a id='5-5'>5-5 Most Common Selected Text</a>

In [None]:
n = 10
train_data['selected_text'].value_counts()[:n].index.tolist()

### <a id='5-5-1'>5-5-1 For categories</a>

In [None]:
n = 3
train_data[train_data['sentiment']=='neutral']['selected_text'].value_counts()[:n].index.tolist()

In [None]:
n = 3
train_data[train_data['sentiment']=='positive']['selected_text'].value_counts()[:n].index.tolist()

In [None]:
n = 3
train_data[train_data['sentiment']=='negative']['selected_text'].value_counts()[:n].index.tolist()

## <a id='6'>6 Sample Submission</a>

In [None]:
test_data.head()

In [None]:
test_data['selected_text'] = 'good'
test_data['selected_text'] = np.where(test_data['sentiment']=='neutral' , 'I see', test_data['selected_text']) 
test_data['selected_text'] = np.where(test_data['sentiment']=='negative' , 'miss', test_data['selected_text']) 
test_data.head(6)

In [None]:
test_data['selected_text'] = test_data['text']
test_data.head(6)

In [None]:
sample = pd.read_csv("../input/tweet-sentiment-extraction/sample_submission.csv")
sample.loc[:, 'selected_text'] = test_data['selected_text']
sample.to_csv("submission.csv", index=False)
sample.head(6)

# More To Come. Stay Tuned. !!