<h1 style="text-align:center">Covid-19 Tweet Analysis</h1>

<div style="text-align:center;"><img src="https://images.unsplash.com/photo-1592499879835-3a1691ab26be?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1650&q=80" /></div>

**Context:** 
> Perform Text Classification on the data. The tweets have been pulled from Twitter and manual tagging has been done then.
The names and usernames have been given codes to avoid any privacy concerns.

**About the Data:**

1) Location   
2) Tweet At   
3) Original Tweet   
4) Label


# Imports

In [None]:
# Data Processing
import numpy as np 
import pandas as pd 
import re

# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.colors import n_colors
from plotly.subplots import make_subplots

import missingno as msno


import seaborn as sns
sns.set(style='whitegrid')

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Exploratory Data Analysis

In [None]:
train = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv", encoding='latin-1')
test = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_test.csv")

In [None]:
train

## Target Value - Sentiment

In [None]:
plt.figure(figsize=(15,5))
b = sns.countplot(x='Sentiment', data=train, order=['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive'])
b.set_title("Sentiment Distribution");

We can see that most posts have a positive sentiment and the least posts have an extremely negative sentiment. 

## Missing Values

In [None]:
train.isna().sum()

In [None]:
msno.matrix(train);

We only have missing values for `Location`.

## User Name

In [None]:
train['UserName'].nunique()

We have as many unique Users as we have tweets.

# ScreenName

In [None]:
train['ScreenName'].nunique()

We again have the same amount of unique ScreenNames as we have tweets.

In [None]:
train.head()

## OriginalTweet

### Preprocessing

In [None]:
# Remove URLs

def remove_urls(text):
    return re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)
train['Content']=train['OriginalTweet'].apply(lambda x:remove_urls(x))

In [None]:
# Remove HTML

def remove_urls(text):
    return re.sub(r'<.*?>', '', text)
train['Content']=train['Content'].apply(lambda x:remove_urls(x))

### Word Cloud

**Which words appear the most in tweets?**

In [None]:
fig, (ax) = plt.subplots(1,1,figsize=[15, 10])
wc = WordCloud(width=600,height=400, background_color='white', colormap="Greys").generate(" ".join(train['Content']))

ax.imshow(wc,interpolation='bilinear')
ax.axis('off')
ax.set_title('Wordcloud of Tweets');

**Which words appear the most in tweets with an extremely positive sentiment?**

In [None]:
fig, (ax) = plt.subplots(1,1,figsize=[15, 10])
wc = WordCloud(width=600,height=400, background_color='white', colormap="Greens").generate(" ".join(train['Content'][train['Sentiment'] == 'Extremely Positive']))

ax.imshow(wc,interpolation='bilinear')
ax.axis('off')
ax.set_title('Wordcloud of Tweets with Extremely Positive Sentiment');

**Which words appear the most in tweets with an positive sentiment?**

In [None]:
fig, (ax) = plt.subplots(1,1,figsize=[15, 10])
wc = WordCloud(width=600,height=400, background_color='white', colormap="Blues").generate(" ".join(train['Content'][train['Sentiment'] == 'Positive']))

ax.imshow(wc,interpolation='bilinear')
ax.axis('off')
ax.set_title('Wordcloud of Tweets with Positive Sentiment');

**Which words appear the most in tweets with an neutral sentiment?**

In [None]:
fig, (ax) = plt.subplots(1,1,figsize=[15, 10])
wc = WordCloud(width=600,height=400, background_color='white', colormap="Purples").generate(" ".join(train['Content'][train['Sentiment'] == 'Neutral']))

ax.imshow(wc,interpolation='bilinear')
ax.axis('off')
ax.set_title('Wordcloud of Tweets with Neutral Sentiment');

**Which words appear the most in tweets with an negative sentiment?**

In [None]:
fig, (ax) = plt.subplots(1,1,figsize=[15, 10])
wc = WordCloud(width=600,height=400, background_color='white', colormap="Oranges").generate(" ".join(train['Content'][train['Sentiment'] == 'Negative']))

ax.imshow(wc,interpolation='bilinear')
ax.axis('off')
ax.set_title('Wordcloud of Tweets with Negative Sentiment');

**Which words appear the most in tweets with an extremely negative sentiment?**

In [None]:
fig, (ax) = plt.subplots(1,1,figsize=[15, 10])
wc = WordCloud(width=600,height=400, background_color='white', colormap="Reds").generate(" ".join(train['Content'][train['Sentiment'] == 'Extremely Negative']))

ax.imshow(wc,interpolation='bilinear')
ax.axis('off')
ax.set_title('Wordcloud of Tweets with Extremely Negative Sentiment');

### Tweet Length

In [None]:
# Get Tweet length

def tweet_length(text):
    return len(text)
train['TweetLength']=train['Content'].apply(lambda x:tweet_length(x))

In [None]:
b = sns.boxplot(y = 'TweetLength', data = train)
b.set_title("TweetLength Distribution");

In [None]:
b = sns.boxplot(y = train['TweetLength'][train['Sentiment'] == 'Extremely Positive'], data = train)
b.set_title("TweetLength Distribution for Extremely Positive Sentiment");

In [None]:
plt.figure(figsize=(15,5))
b = sns.boxplot(y='TweetLength', x='Sentiment', data=train, order=['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']);
b.set_title("TweetLength Distribution for Sentiment");

## Location

In [None]:
train['Country'] = train['Location'].str.split(',').str[-1]

# **To be continued...**