# Sentiment Analysis of Health Tweets -- Data Wangling

## Introduction

This notebook goes through data cleaning step of this project. 

Specifically, I'll be walking through:

1. **Regular data cleaning**
   - NA data
   - Duplicates
   - Change the type of some columns
2. **Text data cleaning** 
   - Make text all lower case
   - Removing URLs
   - Remove punctuation
   - Remove common non-sensical text (\n)
   - Remove the emoji
   - Remove stopwords


## Problem Statement
Health is one of  the most important things in our life. But different people have different concerns about their health. 

The goal of this project is to know what type of health people are concerned about more, and apply the sentiment analysis to check how negative they think about their health, and how much time they cost on their health.

## Loading the data


In [1]:
#load python packages
import os
import pandas as pd
import datetime
import time
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import re
import string
from datetime import datetime
from nltk.corpus import stopwords
import nltk
from collections import Counter
%matplotlib inline

In [2]:
df_1 = pd.read_csv('data/20201110_220931_health_tweets.csv')
df_1.head()

Unnamed: 0,user_name,user_description,user_location,tweetID,date,text
0,mirasenthirajah,,,1323776974372831232,2020-11-03 23:59:59,anxiety on x games mode 😳
1,HonniMX,Monsta X💙\nOnlyMbb👑❤\n14-5-19\n💙Cuenta dedicad...,,1323776974129684484,2020-11-03 23:59:59,@OfficialMonstaX\nOh i'm sorry did i make...
2,truenene,📣||JESUS IS COMING BACK FOR REAL📣|| WWE & Mess...,Southpark,1323776973685010435,2020-11-03 23:59:59,The way some of you dey see your body for this...
3,Valcore_,Avid soysauce drinker | Blender Artist | PFP b...,"Florida, USA",1323776973240487938,2020-11-03 23:59:59,@ewjulii Count the ways Funtime Freddy doing t...
4,JoanaBCT,"Mbb Ot7, Monwenee Forever 🐶🐹🐻🐢🐝 🐺🐰, 🧡💛💚💙💜🖤♥️",Monbebe 💖,1323776972615462912,2020-11-03 23:59:59,@MX_7Mention @OfficialMonstaX @official__wonho...


In [3]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67500 entries, 0 to 67499
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   user_name         67500 non-null  object
 1   user_description  59744 non-null  object
 2   user_location     46512 non-null  object
 3   tweetID           67500 non-null  int64 
 4   date              67500 non-null  object
 5   text              67500 non-null  object
dtypes: int64(1), object(5)
memory usage: 3.1+ MB


There are null values in user_description and user_location, but for this project we don't need this features. Let's select the features I will be working on.

In [4]:
# let's select the needed columns
needed_columns = ['user_name', 'date', 'text']
df = df_1[needed_columns]
df.head()

Unnamed: 0,user_name,date,text
0,mirasenthirajah,2020-11-03 23:59:59,anxiety on x games mode 😳
1,HonniMX,2020-11-03 23:59:59,@OfficialMonstaX\nOh i'm sorry did i make...
2,truenene,2020-11-03 23:59:59,The way some of you dey see your body for this...
3,Valcore_,2020-11-03 23:59:59,@ewjulii Count the ways Funtime Freddy doing t...
4,JoanaBCT,2020-11-03 23:59:59,@MX_7Mention @OfficialMonstaX @official__wonho...


### Check the duplicate rows
You can see there are a lot of duplicate rows, so I need to delete those duplicate rows.

In [5]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Unnamed: 0,user_name,date,text
2500,mirasenthirajah,2020-11-03 23:59:59,anxiety on x games mode 😳
2501,HonniMX,2020-11-03 23:59:59,@OfficialMonstaX\nOh i'm sorry did i make...
2502,truenene,2020-11-03 23:59:59,The way some of you dey see your body for this...
2503,Valcore_,2020-11-03 23:59:59,@ewjulii Count the ways Funtime Freddy doing t...
2504,JoanaBCT,2020-11-03 23:59:59,@MX_7Mention @OfficialMonstaX @official__wonho...
...,...,...,...
67495,rashidbinnur,2020-11-03 23:51:45,Ugh I'm simping
67496,AntiWokeBritain,2020-11-03 23:51:45,Ferguson/ICL model premise: interventions simp...
67497,MaconBibb,2020-11-03 23:51:45,Commissioners approve agreement with @NewtownM...
67498,MaryRarick,2020-11-03 23:51:44,@brat2381 My mouth is watering. I have ham hoc...


In [6]:
# Remove the duplicate rows
df = df.drop_duplicates()
df = df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20784 entries, 0 to 20783
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_name  20784 non-null  object
 1   date       20784 non-null  object
 2   text       20784 non-null  object
dtypes: object(3)
memory usage: 487.2+ KB


### Change the type of some columns

In [7]:
# change the type of some columns
df.user_name = df.user_name.astype('category')
df.user_name = df.user_name.cat.codes

df.date = pd.to_datetime(df.date).dt.date
df.head()

Unnamed: 0,user_name,date,text
0,15317,2020-11-03,anxiety on x games mode 😳
1,3333,2020-11-03,@OfficialMonstaX\nOh i'm sorry did i make...
2,18410,2020-11-03,The way some of you dey see your body for this...
3,8294,2020-11-03,@ewjulii Count the ways Funtime Freddy doing t...
4,3949,2020-11-03,@MX_7Mention @OfficialMonstaX @official__wonho...


In [8]:
df.shape

(20784, 3)

## Data cleaning
Data cleaning steps on all text:
- Make text all lower case
- Removing URLs
- Remove punctuation
- Remove words containing numbers
- Remove common non-sensical text (\n)
- Remove the emoji
- Remove stopwords

In [9]:
# create emoji pattern
emoji_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                           "]+", flags = re.UNICODE)

# Get the stopwords
stop_words = set(stopwords.words('English'))

def clean_text(text):
    
    text = re.sub(r'https\S+', '', text) # removing URLs
    text = text.lower() # make text lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words]) # remove stopwords
    text = text.translate(str.maketrans('','',string.punctuation)) # remove punctuation
    text = emoji_pattern.sub(r'', text) # remove emoji
    text = re.sub('\n', '', text) #remove common non-sensical text (\n)
    
    return text

clean = lambda x: clean_text(x)

In [10]:
# Let's take a look at the updated text
df['text'] = df.text.apply(clean)
df.text

0                                    anxiety x games mode 
1                 officialmonstax oh im sorry make anxious
2          way dey see body app top deɛ asɛ mo nta mpo kai
3        ewjulii count ways funtime freddy whip nae nae...
4        mx7mention officialmonstax officialwonho love ...
                               ...                        
20779                                       ugh im simping
20780    fergusonicl model premise interventions simply...
20781    brat2381 mouth watering ham hocks freezer time...
20782    ugh hate waterbug roaches get outside know the...
20783    neilclark66 witty clearly unwell  blackmailed ...
Name: text, Length: 20784, dtype: object

Let's stop cleaning for now. We can come back to do more cleaning if the results don't make sense. 


### Export data to a new csv file 

In [11]:
# save the clean data for later use
df.to_csv('data/clean_data.csv')