# Political Social Media Analysis

In this project, I will try to compare the tweets of Donald Trump, Barrack Obama, and Hillary Clinton to come up with meaningful insights

Data:
There are 3 CSV files which will be used:
1. DonaldTrump
2. BarackObama
3. HillaryClinton

All 3 have the same structure
sl no,date,id,link,retweet,text,author

Import libraries

In [1]:
import pandas as pd
import re

Read the data

In [2]:
trump = pd.read_csv("data/DonaldTrump.csv")
obama = pd.read_csv("data/BarackObama.csv")
clinton = pd.read_csv("data/HillaryClinton.csv")

### Check out the data

In [3]:
print("Number of tweets by Trump: ", trump.shape[0])
print("Number of tweets by Obama: ", obama.shape[0])
print("Number of tweets by Hillary: ", clinton.shape[0])

Number of tweets by Trump:  8439
Number of tweets by Obama:  6896
Number of tweets by Hillary:  3256


In [4]:
trump.head()

Unnamed: 0.1,Unnamed: 0,date,id,link,retweet,text,author
0,0,Oct-07,7.84609e+17,/realDonaldTrump/status/784609194234306560,False,Here is my statement.pic.twitter.com/WAZiGoQqMQ,DonaldTrump
1,1,Oct-10,7.85609e+17,/realDonaldTrump/status/785608815962099712,False,Is this really America? Terrible!pic.twitter.c...,DonaldTrump
2,2,Oct-08,7.84841e+17,/realDonaldTrump/status/784840992734064641,False,The media and establishment want me out of the...,DonaldTrump
3,3,Oct-11,7.85979e+17,/realDonaldTrump/status/785979396620324865,False,"Wow, @CNN Town Hall questions were given to Cr...",DonaldTrump
4,4,Oct-10,7.85561e+17,/realDonaldTrump/status/785561269571026946,False,Debate polls look great - thank you!\r\n#MAGA ...,DonaldTrump


In [5]:
obama.head()

Unnamed: 0.1,Unnamed: 0,date,id,link,retweet,text,author
0,0,Oct-13,7.86983e+17,/BarackObama/status/786982739517943808,False,Denying climate change is dangerous. Join @OFA...,BarackObama
1,1,Oct-13,7.8701e+17,/BarackObama/status/787010142378332160,False,The American Bar Association gave Judge Garlan...,BarackObama
2,2,Oct-13,7.8704e+17,/BarackObama/status/787039774330748928,False,We need a fully functional Supreme Court. Edit...,BarackObama
3,3,Oct-13,7.86964e+17,/BarackObama/status/786964419905523712,False,"Cynics, take note: When we #ActOnClimate, we b...",BarackObama
4,4,Oct-13,7.86681e+17,/BarackObama/status/786680553617629185,False,"""That’s how we will overcome the challenges we...",BarackObama


In [6]:
clinton.head()

Unnamed: 0.1,Unnamed: 0,date,id,link,retweet,text,author
0,0,Oct-09,7.85272e+17,/HillaryClinton/status/785272428905791489,False,Remember. #Debatepic.twitter.com/rlMbTt5WwY,HillaryClinton
1,1,Oct-09,7.85325e+17,/HillaryClinton/status/785325012152713216,False,She won. http://hrc.io/2dQkjip #Debatepic.twi...,HillaryClinton
2,2,Oct-09,7.85283e+17,/HillaryClinton/status/785282982261190656,False,Let's go. #Debatepic.twitter.com/HD3ZVJ9xl8,HillaryClinton
3,3,Oct-09,7.86964e+17,/HillaryClinton/status/786963642080227328,False,"""Everyone knows how bright she is and how resi...",HillaryClinton
4,4,Oct-09,7.86958e+17,/HillaryClinton/status/786958117531742208,False,"""All the progress we've made these last 8 year...",HillaryClinton


## Data Cleaning

A couple of issues have been identified by eye-balling.
1. There is an unnamed column which is unnecessary and must be dropped
2. ID, and link are not required as well
3. Date format is not standard
4. Since this is a tweet, it contains many non-alphanumeric characters as well

In [7]:
#Dropping the unnecessary columns
drop_cols = ["Unnamed: 0", "id", "link"]
trump.drop(drop_cols, axis=1, inplace=True)
obama.drop(drop_cols, axis=1, inplace=True)
clinton.drop(drop_cols, axis=1, inplace=True)

For our analysis, we do not need non-alphanumeric characters like "%^&*() etc"  
We can also remove all references to images like "pic.twitter.com."

In [8]:
def replaceUnnecessaryText(text):
    text = text.split('pic.twitter')[0]   #We can also remove the references to an image
    text_list = text.split()
    text=""
    for word in text_list:
        word = re.sub(r'@\S+', '', word)     #Remove all words beginning with @ to remove references to people
        text+=" "+word
    return re.sub('[^a-zA-Z0-9#@.,/?! ]', '', text)


trump['text'] = trump['text'].apply(replaceUnnecessaryText)
obama['text'] = obama['text'].apply(replaceUnnecessaryText)
clinton['text'] = clinton['text'].apply(replaceUnnecessaryText)

Now we will work on standardising the date format

Some dates have years while some don't.
The ones who don't are of the year 2016. Hence, we will first attach 2017 to the date

In [9]:
def changeDateFormat(date):
    new_date = date
    if(len(date)<7):                        #Date doesn't have year attached yet
        if('-' in date):
            date_arr = date.split('-')         #Date can be separated by '-' or ' '
        else:            
            date_arr = date.split()            #Add year 2016 to the dates with no year.
        new_date = date_arr[1]+" "+date_arr[0]+" 2016"
    return new_date

In [10]:
trump.date = trump.date.apply(changeDateFormat)
obama.date = obama.date.apply(changeDateFormat)
clinton.date = clinton.date.apply(changeDateFormat)

Convert the string to a datetime format

In [11]:
trump['date'] = pd.to_datetime(trump['date'])
obama['date'] = pd.to_datetime(obama['date'])
clinton['date'] = pd.to_datetime(clinton['date'])

All Trump and Clinton tweets are from 01-01-2014 while Obama tweets start way earlier in 20017.

Remove all pre-2014 Obama tweets

In [12]:
obama = obama[obama['date'].dt.year > 2013]

In [13]:
#By printing the tail, we can confirm that all pre-2014 tweets have been dropped
obama.tail()

Unnamed: 0,date,retweet,text,author
2120,2014-01-02,False,Luis can tell people hes going to the doctor ...,BarackObama
2121,2014-01-01,False,Rosetta from Missouri is covered. #ThisIsWhy ...,BarackObama
2122,2014-01-01,False,#ThisIsWhy Virginia can go to the doctor agai...,BarackObama
2123,2014-01-01,False,See stories of people who just got covered in...,BarackObama
2124,2014-01-01,False,This is why the Affordable Care Act matters. ...,BarackObama


Now let's compare the length of all 3 again

In [14]:
print("Number of tweets by Trump: ", trump.shape[0])
print("Number of tweets by Obama: ", obama.shape[0])
print("Number of tweets by Hillary: ", clinton.shape[0])

Number of tweets by Trump:  8439
Number of tweets by Obama:  2125
Number of tweets by Hillary:  3256


In [15]:
trump.head()

Unnamed: 0,date,retweet,text,author
0,2016-10-07,False,Here is my statement.,DonaldTrump
1,2016-10-10,False,Is this really America? Terrible!,DonaldTrump
2,2016-10-08,False,The media and establishment want me out of th...,DonaldTrump
3,2016-10-11,False,"Wow, Town Hall questions were given to Crook...",DonaldTrump
4,2016-10-10,False,Debate polls look great thank you! #MAGA #Am...,DonaldTrump


## Writing to CSV files

Use index=False so that you don't save the first unnamed column in the CSV file

In [16]:
trump.to_csv("data/DonaldTrumpClean.csv", index=False)
obama.to_csv("data/BarackObamaClean.csv", index=False)
clinton.to_csv("data/HillaryClintonClean.csv", index=False)

## References

Data collected from: https://www.kaggle.com/nandys/social-media-analysis-kim-kardashian/data