### Data processing for Twitter sentiment analysis
#### Vivek Yadav, PhD

This notebook contains preprocessing steps for the project I did for Data visualization in D3 course of Udacity's Data Science Nano-Degree. I downloaded data from [Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment), and wrote code to obtain latitude and longitude data for tweets with geographic location. I then obtained the list of [30 busiest airport from Wikipedia](https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States), and assigned each tweet with location data to an airport based on proximity. I assumed that a tweet about an airline came from the airport closest to the location of tweet. After this, I exported data as CSV and performed further analysis using Javascript directly. 

Link to the final project: [US Airline Twitter Sentiment Analysis](http://vxy10.github.io/p6_vis_versions/scroll_MapVersion2/index.html)

In [4]:
# loading data 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.tools as tls
import plotly.plotly as py
import plotly.graph_objs as go
from ggplot import *
%pylab inline


def isNaN(num):
    return num != num

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


#### Loading data

I first load data from kaggle.

In [5]:
df = pd.read_csv('Tweets.csv')
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


#### Importing airport data

Here I imported data from Wikipedia to get locations of 30 busiest airports. 

In [6]:
df_airport = pd.read_csv('data_airports.csv')
df_airport.head()

Unnamed: 0,airport_id,name,city,State,latitude,longitude
0,1,Hartsfield Jackson Atlanta Intl,Atlanta,GA,33.636719,-84.428067
1,2,Los Angeles Intl,Los Angeles,CA,33.942536,-118.408075
2,3,Chicago Ohare Intl,Chicago,IL,41.978603,-87.904842
3,4,Dallas Fort Worth Intl,Dallas-fort Worth,TX,32.896828,-97.037997
4,5,John F Kennedy Intl,New York,NY,40.639751,-73.778925


I then obtain locations of twitter feed from location data. I imputed 0 for cases where the location information was not available.

In [5]:
df["latitude"] = 0.0
df["longitude"] = 0.0
i = -1
for num in df.tweet_coord:
    i = i+1
    if isNaN(num)==False:
        a = num
        aa = a[1:len(a)-1]
        aa.split(',')
        df["latitude"][i] = float(aa.split(",")[0])
        df["longitude"][i] = float(aa.split(",")[1])
        
df.head()        
       
 
        



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,latitude,longitude
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),0,0
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),0,0
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),0,0
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),0,0
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),0,0


I then computed distance between tweet location and the closest airport from the top 30 airports, and assign airport for each tweet with location data. 

In [8]:
df['airport_name'] = '';
df['airport_city'] = '';
df['airport_state'] = '';
df['airport_lat'] = 0.0;
df['airport_lon'] = 0.0;



for i in range(0,len(df['latitude'])):
    d_all = 0;
    if (df['latitude'][i]!=0):
        d_all = (df_airport['latitude']-df['latitude'][i])**2+(df_airport['longitude']-df['longitude'][i])**2
        ind_min = np.argmin(d_all)
        #print i,df_airport['latitude'][ind_min];
        df['airport_name'][i]=df_airport['name'][ind_min]
        df['airport_city'][i]=df_airport['city'][ind_min]
        df['airport_state'][i]=df_airport['State'][ind_min]
        df['airport_lat'][i] = df_airport['latitude'][ind_min];
        df['airport_lon'][i] = df_airport['longitude'][ind_min];
    
        



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



I then selected relavant columns of from twitter data and imported it as CSV for plotting in d3. 

In [11]:
df_small = df[['tweet_id',"airline_sentiment","airline","negativereason","latitude","longitude","user_timezone",
              'airport_name','airport_city','airport_state','airport_lat','airport_lon']]
df_small.to_csv('data_twitter.csv')
df_small

Unnamed: 0,tweet_id,airline_sentiment,airline,negativereason,latitude,longitude,user_timezone,airport_name,airport_city,airport_state,airport_lat,airport_lon
0,570306133677760513,neutral,Virgin America,,0.000000,0.000000,Eastern Time (US & Canada),,,,0.000000,0.000000
1,570301130888122368,positive,Virgin America,,0.000000,0.000000,Pacific Time (US & Canada),,,,0.000000,0.000000
2,570301083672813571,neutral,Virgin America,,0.000000,0.000000,Central Time (US & Canada),,,,0.000000,0.000000
3,570301031407624196,negative,Virgin America,Bad Flight,0.000000,0.000000,Pacific Time (US & Canada),,,,0.000000,0.000000
4,570300817074462722,negative,Virgin America,Can't Tell,0.000000,0.000000,Pacific Time (US & Canada),,,,0.000000,0.000000
5,570300767074181121,negative,Virgin America,Can't Tell,0.000000,0.000000,Pacific Time (US & Canada),,,,0.000000,0.000000
6,570300616901320704,positive,Virgin America,,0.000000,0.000000,Pacific Time (US & Canada),,,,0.000000,0.000000
7,570300248553349120,neutral,Virgin America,,0.000000,0.000000,Pacific Time (US & Canada),,,,0.000000,0.000000
8,570299953286942721,positive,Virgin America,,0.000000,0.000000,Pacific Time (US & Canada),,,,0.000000,0.000000
9,570295459631263746,positive,Virgin America,,0.000000,0.000000,Eastern Time (US & Canada),,,,0.000000,0.000000
