# Sentiment Analysis of Health Tweets -- Data Wangling

## Introduction

This notebook goes through data cleaning step of this project. 

Specifically, I'll be walking through:

1. **Regular data cleaning**
   - NA data
   - Duplicates
   - Change the type of some columns
2. **Text data cleaning** 
   - Make text all lower case
   - Removing URLs
   - Remove punctuation
   - Remove common non-sensical text (\n)
   - Remove the emoji
   - Remove stopwords


## Problem Statement
Health is one of  the most important things in our life. But different people have different concerns about their health. 

The goal of this project is to know what type of health people are concerned about more, and apply the sentiment analysis to check how negative they think about their health, and how much time they cost on their health.

## Loading the data


In [1]:
#load python packages
import os
import pandas as pd
import datetime
import time
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import re
import string
from datetime import datetime
from nltk.corpus import stopwords
import nltk
from collections import Counter
%matplotlib inline

In [2]:
# Load all datasets
df_1 = pd.read_csv('data/20201110_220931_health_tweets.csv')
df_2 = pd.read_csv('data/20201111_065943_health_tweets.csv')
df_3 = pd.read_csv('data/20201111_143013_health_tweets.csv')
df_4 = pd.read_csv('data/20201112_030746_health_tweets.csv')
df_5 = pd.read_csv('data/20201112_113022_health_tweets.csv')
df_6 = pd.read_csv('data/20201112_230533_health_tweets.csv')
df_7 = pd.read_csv('data/20201113_071344_health_tweets.csv')

In [3]:
# merge all datasets to one
frames = [df_1, df_2, df_3, df_4, df_5, df_6, df_7] 
df = pd.concat(frames)
df.head()

Unnamed: 0,user_name,user_description,user_location,tweetID,date,text
0,mirasenthirajah,,,1323776974372831232,2020-11-03 23:59:59,anxiety on x games mode 😳
1,HonniMX,Monsta X💙\nOnlyMbb👑❤\n14-5-19\n💙Cuenta dedicad...,,1323776974129684484,2020-11-03 23:59:59,@OfficialMonstaX\nOh i'm sorry did i make...
2,truenene,📣||JESUS IS COMING BACK FOR REAL📣|| WWE & Mess...,Southpark,1323776973685010435,2020-11-03 23:59:59,The way some of you dey see your body for this...
3,Valcore_,Avid soysauce drinker | Blender Artist | PFP b...,"Florida, USA",1323776973240487938,2020-11-03 23:59:59,@ewjulii Count the ways Funtime Freddy doing t...
4,JoanaBCT,"Mbb Ot7, Monwenee Forever 🐶🐹🐻🐢🐝 🐺🐰, 🧡💛💚💙💜🖤♥️",Monbebe 💖,1323776972615462912,2020-11-03 23:59:59,@MX_7Mention @OfficialMonstaX @official__wonho...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 472500 entries, 0 to 67499
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   user_name         472500 non-null  object
 1   user_description  414194 non-null  object
 2   user_location     322606 non-null  object
 3   tweetID           472500 non-null  int64 
 4   date              472500 non-null  object
 5   text              472500 non-null  object
dtypes: int64(1), object(5)
memory usage: 25.2+ MB


There are null values in user_description and user_location, but for this project we don't need this features. Let's select the features I will be working on.

In [5]:
# let's select the needed columns
needed_columns = ['user_name', 'date', 'text']
df = df[needed_columns]
df.head()

Unnamed: 0,user_name,date,text
0,mirasenthirajah,2020-11-03 23:59:59,anxiety on x games mode 😳
1,HonniMX,2020-11-03 23:59:59,@OfficialMonstaX\nOh i'm sorry did i make...
2,truenene,2020-11-03 23:59:59,The way some of you dey see your body for this...
3,Valcore_,2020-11-03 23:59:59,@ewjulii Count the ways Funtime Freddy doing t...
4,JoanaBCT,2020-11-03 23:59:59,@MX_7Mention @OfficialMonstaX @official__wonho...


### Check the duplicate rows
You can see there are a lot of duplicate rows, so I need to delete those duplicate rows.

In [6]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Unnamed: 0,user_name,date,text
2500,mirasenthirajah,2020-11-03 23:59:59,anxiety on x games mode 😳
2501,HonniMX,2020-11-03 23:59:59,@OfficialMonstaX\nOh i'm sorry did i make...
2502,truenene,2020-11-03 23:59:59,The way some of you dey see your body for this...
2503,Valcore_,2020-11-03 23:59:59,@ewjulii Count the ways Funtime Freddy doing t...
2504,JoanaBCT,2020-11-03 23:59:59,@MX_7Mention @OfficialMonstaX @official__wonho...
...,...,...,...
67494,thepotatsaIad,2020-11-09 23:56:33,@CireFlame @nytimes https://t.co/SK9hpYIHsb i...
67495,MadnessTouch,2020-11-09 23:56:33,Imma be honest I literally know 3or 4 mufos. I...
67496,rob_stayton,2020-11-09 23:56:33,"""Today is a really historic day in the history..."
67497,australiawhisky,2020-11-09 23:56:33,@doorsausage @profesterman On balance though N...


In [7]:
# Remove the duplicate rows
df = df.drop_duplicates()
df = df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146046 entries, 0 to 146045
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_name  146046 non-null  object
 1   date       146046 non-null  object
 2   text       146046 non-null  object
dtypes: object(3)
memory usage: 3.3+ MB


Here, we got big amount of duplicate rows, that's probably because one tweet includes more keywords and extract multiple time. I got 146,046 tweets after dropping the duplicate rows. The final amount of tweets should be enough to do analysis.

### Change the type of some columns

In [8]:
# change the type of some columns
df.user_name = df.user_name.astype('category')
df.user_name = df.user_name.cat.codes

df.date = pd.to_datetime(df.date).dt.date
df.head()

Unnamed: 0,user_name,date,text
0,104885,2020-11-03,anxiety on x games mode 😳
1,23653,2020-11-03,@OfficialMonstaX\nOh i'm sorry did i make...
2,126001,2020-11-03,The way some of you dey see your body for this...
3,57974,2020-11-03,@ewjulii Count the ways Funtime Freddy doing t...
4,27889,2020-11-03,@MX_7Mention @OfficialMonstaX @official__wonho...


In [9]:
df.shape

(146046, 3)

## Data cleaning
Data cleaning steps on all text:
- Make text all lower case
- Removing URLs
- Remove punctuation
- Remove words containing numbers
- Remove common non-sensical text (\n)
- Remove the emoji
- Remove stopwords

In [10]:
# create emoji pattern
#emoji_pattern = re.compile(pattern = "["
#        u"\U0001F600-\U0001F64F"  # emoticons
#        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
#        u"\U0001F680-\U0001F6FF"  # transport & map symbols
#        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
#        u"\U00002500-\U00002BEF"  # chinese char
#        u"\U00002702-\U000027B0"
#        u"\U00002702-\U000027B0"
#        u"\U000024C2-\U0001F251"
#        u"\U0001f926-\U0001f937"
#        u"\U00010000-\U0010ffff"
#        u"\u2640-\u2642"
#        u"\u2600-\u2B55"
#        u"\u200d"
#        u"\u23cf"
#        u"\u23e9"
#        u"\u231a"
#        u"\ufe0f"  # dingbats
#        u"\u3030"
#                           "]+", flags = re.UNICODE)

# Get the stopwords
stop_words = set(stopwords.words('English'))

def clean_text(text):
    
    text = re.sub(r'https\S+', '', text) # removing URLs
    text = text.lower() # make text lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words]) # remove stopwords
    text = text.translate(str.maketrans('','',string.punctuation)) # remove punctuation
#    text = emoji_pattern.sub(r'', text) # remove emoji
    text = re.sub('\n', '', text) #remove common non-sensical text (\n)
    
    return text

clean = lambda x: clean_text(x)

In [11]:
# Let's take a look at the updated text
df['text'] = df.text.apply(clean)
df.text

0                                    anxiety x games mode 😳
1                  officialmonstax oh im sorry make anxious
2           way dey see body app top deɛ asɛ mo nta mpo kai
3         ewjulii count ways funtime freddy whip nae nae...
4         mx7mention officialmonstax officialwonho love ...
                                ...                        
146041    imma honest literally know 3or 4 mufos would w...
146042    today really historic day history public healt...
146043    doorsausage profesterman balance though nsw ha...
146044    less 1week election dr anthony fauci says news...
146045    ill make official post tomorrow ab unf spree w...
Name: text, Length: 146046, dtype: object

Let's stop cleaning for now. We can come back to do more cleaning if the results don't make sense. 


### Export data to a new csv file 

In [12]:
# save the clean data for later use
df.to_csv('data/clean_data.csv')