# Hate Speech Sentiment Analysis of Twitter Data

This notebook is my attempt at following the tutorial by Prateek Joshi on [Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset and code
](https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/).

- Test and train data is gotten from [here](https://www.kaggle.com/datasets/dv1453/twitter-sentiment-analysis-analytics-vidya?resource=download)

In [10]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline
train  = pd.read_csv('./data/train_tweets.csv')
test = pd.read_csv('./data/test_tweets.csv')

- We read the train and test data and store in `train` and `test` respectively.

- `%matplotlib inline` above is to embed Matplotlib plots directly within your notebook output cells. This means that instead of plots appearing in separate windows, they are displayed below the code that generated them.

- Next, let us view the first 5 rows of the train data using `train.head()`

In [11]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


## Data Cleansing

We'll now proceed to do some data cleansing.

Initial data cleaning requirements that we can think of after looking at the top 5 records:

- The Twitter handles are already masked as `@user due to privacy concerns. So, these Twitter handles are hardly giving any information about the nature of the tweet.
- We can also think of getting rid of the punctuations, numbers and even special characters since they wouldn’t help in differentiating different kinds of tweets.
- Most of the smaller words do not add much value. For example, ‘pdx’, ‘his’, ‘all’. So, we will try to remove them as well from our data.
- Once we have executed the above three steps, we can split every tweet into individual words or tokens which is an essential step in any NLP task.
- In the 4th tweet, there is a word ‘love’. We might also have terms like loves, loving, lovable, etc. in the rest of the data. These terms are often used in the same context. If we can reduce them to their root word, which is ‘love’, then we can reduce the total number of unique words in our data without losing a significant amount of information.

### Removing Twitter Handles (@user)

In [20]:
combi = pd.concat([train, test], ignore_index=True)

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    print(r)
    for i in r:
        input_txt = re.sub(i, '', input_txt)

    return input_txt


['b']


'@acd'