# Twitter Sentiment Analysis

This notebook will introduce you to the concept of sentiment analysis in Python

Competition: [Kaggle Competition](https://www.kaggle.com/c/twitter-sentiment-analysis2/)

Getting started with the Kaggle API: [API](https://github.com/Kaggle/kaggle-api)

To learn how the Naive Bayes classifier works: [Jurafsky](https://web.stanford.edu/~jurafsky/slp3/4.pdf)

Other interesting NLP datasets: [NLP Datasets](https://lionbridge.ai/datasets/top-20-twitter-datasets-for-natural-language-processing-and-machine-learning/)

To keep up with the latest research: [NLP Progress](https://nlpprogress.com/)

We will cover:

0. [Install Packages & Download Data](#0-Install-Packages-&-Download-Data)
1. [Importing Libraries](#1.-Importing-Libraries)
2. [Loading Data](#2.-Loading-Data)
3. [Exploratory Data Analysis](#3.-EDA)
4. [Data Preprocessing](#4.-Data-Preprocessing)
5. [Model Training](#5.-Model-Training)
6. [Model Evaluation](#6.-Model-Evaluation)

Visit [Kaggle notebooks on this challenge](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/notebooks) to see more solutions and how to perform feature extraction. 

## 0. Install Packages & Download Data

In [19]:
# !kaggle datasets download crowdflower/twitter-airline-sentiment
# !pip install wordcloud
# !pip install seaborn

## 1. Importing Libraries

In [18]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud,STOPWORDS

import warnings
warnings.filterwarnings('ignore')

## 2. Loading Data

In [13]:
data = pd.read_csv('twitter-airline-sentiment/Tweets.csv')
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## 3. EDA

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

## 4. Data Preprocessing

## 5. Model Training

## 6. Model Evaluation