# Data Cleaning 

We collected the data through the Twitter API, we search for a userID and the amount of tweets that we want, and the API gives us tweets back in the form of twitter objects. Not only do the tweets have the text, but they also have a load of other supplementary data. The twitter object is given back in JSON format. To create the dataset that we have, we chose a few of the data points that we wanted from each object and used lists to combine them into a dataframe. After doing this we had a dataframe generated that contained all of the data attributes that we wanted. The data points that we kept for our dataframe are: the id number, the full text, entities data, date and time created, favourites, retweets, language, geo location and who, if anyone, the tweet was in reply to.

In [2]:
import pandas as pd
import numpy as np
from datetime import date

In [3]:
biden = pd.read_csv("biden_tweets.csv", index_col = 0)
trump = pd.read_csv("trump_tweets.csv", index_col = 0)

### Adding Columns for Analyses

In [4]:
# create date time column 
biden["date"] = pd.to_datetime(biden["created_at"], format = "%Y-%m-%d %H:%M:%S")
trump["date"] = pd.to_datetime(trump["created_at"], format = "%Y-%m-%d %H:%M:%S")
# create time of day column 
# morning = 5 am : 12 pm = any hour value from 5 to 11 
# afternoon = 12 pm : 5 pm = any hour value from 12 to 16 
# evening = 5 pm : 10 pm = any hour value from 17 to 22 
# night = 10 pm : 5 am = any hour value from 23 to 4 (23,0,1,2,3,4) <- wrap around
def time_of_day (x):
    hour = x.hour 
    if  5 <= hour <= 11: 
        return "morning"
    elif 12 <= hour <= 16:
        return "afternoon"
    elif 17 <= hour <= 22:
        return "evening"
    else:
        return "night"

# column for hour of day tweet posted 
def hour(x):
    return x.hour 

# create column for tweet word count  
def tweet_length (x):
    return len(x.split())

# create columns that contain series of night vs day where night = 1 , day = 0 
day_night_map = {"morning":0, "afternoon":0, "evening":1, "night":1}


def to_election_norm(x, dates):
    x['days_to_election'] = date(2020, 11, 3) - dates.dt.date
    # convert from day object to int  
    x['days_to_election'] = x['days_to_election'].dt.days

def to_election_group(x, dates):
    x['days_to_election'] = date(2020, 11, 3) - dates
    # convert from day object to int  
    x['days_to_election'] = x['days_to_election'].dt.days

### Applying functions (using pd.map)

In [5]:
# cell for applying functions and actually creating the columns 
biden['time_of_day'] = biden['date'].map(time_of_day)
biden['tweet_length'] = biden['full_text'].map(tweet_length) # word count of each tweet 
biden['hour'] = biden['date'].map(hour)
biden['day_or_night'] = biden['time_of_day'].map(day_night_map)

trump['time_of_day'] = trump['date'].map(time_of_day)
trump['tweet_length'] = trump['full_text'].map(tweet_length)
trump['hour'] = biden['date'].map(hour)
trump['day_or_night'] = trump['time_of_day'].map(day_night_map)

# create grouped_dfs by day 

bidenbyday = (biden['date'].dt.date).value_counts().rename_axis('unique_dates').reset_index(name='tweets_each_day')
trumpbyday= (trump['date'].dt.date).value_counts().rename_axis('unique_dates').reset_index(name='tweets_each_day')


to_election_norm(biden, biden['date'])
to_election_norm(trump, trump['date'])
# for bidenbyday and trumpbyday df 
to_election_group(bidenbyday, bidenbyday['unique_dates'])
to_election_group(trumpbyday, trumpbyday['unique_dates'])

#### We obtained Biden and Trump tweets by day from the following source: https://stackoverflow.com/questions/47136436/python-pandas-convert-value-counts-output-to-dataframe

### Output to new CSV files

In [15]:
bidenbyday.to_csv("bidenbyday.csv")
trumpbyday.to_csv("trumpbyday.csv")
biden.to_csv("biden_updated.csv")
trump.to_csv("trump_updated.csv")

In [16]:
biden.head()

Unnamed: 0,id,full_text,entities,created_at,favourite,retweets,language,geo,in_reply,date,time_of_day,tweet_length,hour,day_or_night,days_to_election
0,1313663334286323714,Please. https://t.co/acHR1Rf8hE,"{'hashtags': [], 'symbols': [], 'user_mentions...",2020-10-07 02:12:00,90289,14867,en,,,2020-10-07 02:12:00,night,2,2,1,27
1,1313652765319557122,I can’t wait for @KamalaHarris to make history...,"{'hashtags': [], 'symbols': [], 'user_mentions...",2020-10-07 01:30:00,13317,2126,en,,,2020-10-07 01:30:00,night,30,1,1,27
2,1313646473393840128,"Ignore the polls, folks. There’s too much at s...","{'hashtags': [], 'symbols': [], 'user_mentions...",2020-10-07 01:05:00,56278,12104,en,,,2020-10-07 01:05:00,night,28,1,1,27
3,1313641691602202624,Think about what it takes for a Black person t...,"{'hashtags': [], 'symbols': [], 'user_mentions...",2020-10-07 00:46:00,30592,6574,en,,,2020-10-07 00:46:00,night,30,0,1,27
4,1313633890289045504,"Look, folks, I'm going to be honest: we'll try...","{'hashtags': [], 'symbols': [], 'user_mentions...",2020-10-07 00:15:00,13490,3023,en,,,2020-10-07 00:15:00,night,43,0,1,27


In [7]:
type(biden.date[0])

pandas._libs.tslibs.timestamps.Timestamp

In [8]:
biden['date']

0      2020-10-07 02:12:00
1      2020-10-07 01:30:00
2      2020-10-07 01:05:00
3      2020-10-07 00:46:00
4      2020-10-07 00:15:00
               ...        
3079   2019-10-16 03:24:14
3080   2019-10-16 03:21:44
3081   2019-10-16 03:13:06
3082   2019-10-16 02:47:51
3083   2019-10-16 02:35:02
Name: date, Length: 3084, dtype: datetime64[ns]

In [10]:
bidenbyday["unique_dates"][0]

datetime.date(2020, 9, 30)