# ML Pipeline Problem

This question will test some basic skills in cleaning data and building a machine learning pipeline.

The focus of this test is to evaluate:

 * Ability to quickly learn a new framework (luigi)
 * Ability to manipulate and process data (cleaning, processing, feature engineering)
 * Competency in software development

This test does not focus on modelling accuracy, ability to use a fancy model,
or efficiency.  It is mainly about the mechanics of building a proper machine
learning pipeline.

## Datasets

There are two files: `airline_tweets.csv` and `cities.csv`.

`airline_tweets.csv` has twitter data regarding airline sentiment augmented
with some extra columns.  The relevant columns are:

* `airline_sentiment`: a string indicating if the tweet had positive,
  neutral or negative sentiment.
* `tweet_coord`: is a string with form "[<lat>, <long>]" if a
  geo-coordinate exists for that tweet, or an empty string otherwise.

The `cities.csv` contains information about latitude and longitude for large cities.
The relevant columns are:

* `name`: The name of the city.
* `latitude`: The latitude of the city.
* `longitude`: The longitude of the city.

## Problem

Build a basic ML pipeline using the `luigi` Python framework.  The pipeline
should clean the tweet data, prepare features for building a model, train a
classifier and score using the model.  The pipeline should have these steps:

 * `CleanDataTask`: Cleans the input tweet CSV file by removing any rows without valid geo-coordinates.
    * An invalid coordinate has either an empty `tweet_coord` column or is coordinate (0.0, 0.0).
 * `TrainingDataTask`: Extracts features/outcome variable in preparation for training a model.
    * This prepares the cleaned data into the exact form that is able to be fit by the model.
    * The "y" variable will be the multi-class sentiment (0, 1, 2 for negative, neutral and positive respectively).
    * The "X" variables will be the closest city to the "tweet_coord" using Euclidean distance.
    * You should use the `cities.csv` file to find the closest city.
    * You probably will need to one-hot encode the city names.
 * `TrainModelTask`: Trains a classifier to predict negative, neutral, positive based only on the input city.
    * Train a classifier that uses closest cities as features.
    * Dump the fitted model to the output file.
 * `ScoreTask`: Uses the scored model to compute the sentiment for each city.
    * Use the trained model to predict the probability/score for each city the
      negative, neutral and positive sentiment.
    * Output a sorted list of cities by the predicted positive sentiment score to the output file.

## Notes/Hints/Suggestions

 * We have provided a skeleton file to get you started named `pipeline.py`, and a
   script `run.sh` that will execute this luigi pipeline.
 * You must use the `luigi` package.
 * You must use Python (any version is fine).
 * Feel free to use any Python packages.  We used `pandas`, `scikit-learn`, `numpy`
   (as seen in the included requirements.txt).
 * Do not worry too much about run-time/memory efficiency.  So long as it runs
   within 15 minutes, it should be fine.

## References

 * Luigi package: `http://luigi.readthedocs.io/en/stable/`


In [204]:
"""
    There are two concerns here:
    1. 'name' is not unique in the cities.csv. 'geonameid' is a better choice. 
        I use 'name' as required by the assignment
    2. For the ScoreTask to classify the sentiment for each city. I use the closest 
        city list instead of all the cities in the cities.csv

"""

"\n    There are two concerns here:\n    1. 'name' is not unique in the cities.csv. 'geonameid' is a better choice. \n        I use 'name' as required by the assignment\n    2. For the ScoreTask to classify the sentiment for each city. I use the closest \n        city list instead of all the cities in the cities.csv\n\n"

### python pipeline.py ScoreTask --local-scheduler --tweet-file airline_tweets.csv


In [1]:
import luigi
import pandas as pd
import numpy as np
import re
import time
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib


In [2]:
cities_pd = pd.read_csv("cities.csv", encoding = "ISO-8859-1")
airline_tweets_pd = pd.read_csv("airline_tweets.csv", encoding = "ISO-8859-1")

In [3]:
cities_pd.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature class,feature code,country code,cc2,admin1 code,admin2 code,admin3 code,admin4 code,population,elevation,dem,timezone,modification date
0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",42.50729,1.53414,P,PPLA,AD,,8,,,,15853,,1033,Europe/Andorra,2008-10-15
1,3041563,Andorra la Vella,Andorra la Vella,"ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...",42.50779,1.52109,P,PPLC,AD,,7,,,,20430,,1037,Europe/Andorra,2010-05-30
2,290594,Umm al Qaywayn,Umm al Qaywayn,"Oumm al Qaiwain,Oumm al QaÃ¯waÃ¯n,Um al Kawain...",25.56473,55.55517,P,PPLA,AE,,7,,,,44411,,2,Asia/Dubai,2014-10-07
3,291074,Ras al-Khaimah,Ras al-Khaimah,"Julfa,Khaimah,RKT,Ra's al Khaymah,Ra's al-Chai...",25.78953,55.9432,P,PPLA,AE,,5,,,,115949,,2,Asia/Dubai,2015-12-05
4,291696,Khawr FakkÄn,Khawr Fakkan,"Fakkan,FakkÄn,Khawr Fakkan,Khawr FakkÄn,Khaw...",25.33132,56.34199,P,PPL,AE,,6,,,,33575,,20,Asia/Dubai,2013-10-25


In [4]:
cities_pd.shape

(23278, 19)

In [5]:
airline_tweets_pd.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,False,finalized,3,2/25/15 5:24,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2/24/15 11:35,5.70306e+17,,Eastern Time (US & Canada)
1,681448153,False,finalized,3,2/25/15 1:53,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2/24/15 11:15,5.70301e+17,,Pacific Time (US & Canada)
2,681448156,False,finalized,3,2/25/15 10:01,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2/24/15 11:15,5.70301e+17,Lets Play,Central Time (US & Canada)
3,681448158,False,finalized,3,2/25/15 3:05,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2/24/15 11:15,5.70301e+17,,Pacific Time (US & Canada)
4,681448159,False,finalized,3,2/25/15 5:50,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2/24/15 11:14,5.70301e+17,,Pacific Time (US & Canada)


In [6]:
np.unique(airline_tweets_pd['airline_sentiment'].values)

array(['negative', 'neutral', 'positive'], dtype=object)

In [7]:
airline_tweets_pd.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'airline_sentiment',
       'airline_sentiment:confidence', 'negativereason',
       'negativereason:confidence', 'airline', 'airline_sentiment_gold',
       'name', 'negativereason_gold', 'retweet_count', 'text', 'tweet_coord',
       'tweet_created', 'tweet_id', 'tweet_location', 'user_timezone'],
      dtype='object')

In [8]:
airline_tweets_pd.shape #.tail(20)

(14640, 20)

In [9]:
def clean_tweet_coord(pd):
    r_nan_pd = pd[~pd["tweet_coord"].isnull()]
    r_00_pd = r_nan_pd[~r_nan_pd["tweet_coord"].str.contains('[0.0, 0.0]', regex = False)]
    return r_00_pd

In [10]:
clean_pd = clean_tweet_coord(airline_tweets_pd)

In [11]:
clean_pd.shape

(855, 20)

In [12]:
clean_pd

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
21,681448197,False,finalized,3,2/25/15 2:26,positive,1.0000,,,Virgin America,,DT_Les,,0,@VirginAmerica I love this graphic. http://t.c...,"[40.74804263, -73.99295302]",2/24/15 8:49,5.702640e+17,,
28,681448213,False,finalized,3,2/25/15 9:04,negative,1.0000,Bad Flight,1.0000,Virgin America,,blackjackpro911,,0,@VirginAmerica amazing to me that we can't get...,"[42.361016, -71.02000488]",2/24/15 5:05,5.702080e+17,"San Mateo, CA & Las Vegas, NV",
29,681448214,False,finalized,3,2/25/15 9:14,neutral,0.6150,,0.0000,Virgin America,,TenantsUpstairs,,0,@VirginAmerica LAX to EWR - Middle seat on a r...,"[33.94540417, -118.4062472]",2/23/15 23:34,5.701250e+17,Brooklyn,Atlantic Time (Canada)
32,681448223,False,finalized,3,2/25/15 1:57,negative,1.0000,Customer Service Issue,1.0000,Virgin America,,Cuschoolie1,,0,"@VirginAmerica help, left expensive headphones...","[33.94209449, -118.40410103]",2/23/15 21:10,5.700880e+17,Washington DC,Quito
34,681448228,False,finalized,3,2/25/15 1:01,positive,1.0000,,,Virgin America,,NorthTxHomeTeam,,0,@VirginAmerica this is great news! America co...,"[33.2145038, -96.9321504]",2/23/15 20:24,5.700770e+17,Texas,Central Time (US & Canada)
42,681448249,False,finalized,3,2/25/15 6:26,neutral,1.0000,,,Virgin America,,GottAmanda,,0,@VirginAmerica plz help me win my bid upgrade ...,"[34.0219817, -118.38591198]",2/23/15 16:24,5.700160e+17,Los Angeles,
62,681448292,False,finalized,3,2/25/15 8:08,neutral,0.6858,,,Virgin America,,adawson66,,0,@VirginAmerica @ladygaga @carrieunderwood all ...,"[33.57963333, -117.73024772]",2/23/15 14:30,5.699880e+17,,
69,681448308,False,finalized,3,2/25/15 3:22,negative,1.0000,Lost Luggage,1.0000,Virgin America,,gianagon,,0,@VirginAmerica everything was fine until you l...,"[40.6413712, -73.78311558]",2/23/15 13:08,5.699670e+17,New York + Panama,Eastern Time (US & Canada)
74,681448319,False,finalized,3,2/25/15 6:39,positive,1.0000,,,Virgin America,,mrmichaellay,,0,"@VirginAmerica not worried, it's been a great ...","[36.08457854, -115.13780136]",2/23/15 11:32,5.699430e+17,Floridian from Cincinnati,Eastern Time (US & Canada)
108,681448391,False,finalized,3,2/25/15 1:48,neutral,0.6593,,0.0000,Virgin America,,drcaseydrake,,0,@VirginAmerica I was scheduled for SFO 2 DAL f...,"[37.79374402, -122.39327564]",2/23/15 7:28,5.698820e+17,"Dallas, TX",


In [13]:
#clean_pd["tweet_coord"].str.findall(r"[-+]?\d*\.\d+|\d+")

In [14]:
clean_pd.to_csv('clean_tweets.csv', index=False, encoding = "ISO-8859-1")

In [15]:
#clean_pd["tweet_coord"]

In [16]:
clean_pd = pd.read_csv("clean_tweets.csv", encoding = "ISO-8859-1")

In [17]:
#change to array
clean_pd["tweet_coord"] = clean_pd["tweet_coord"].str.findall(r"[-+]?\d*\.\d+|\d+")
#clean_pd["tweet_coord"].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x))

In [18]:
float(clean_pd["tweet_coord"].values[0][0])

40.74804263

In [19]:
city_coord = cities_pd[['latitude', 'longitude']].values

In [20]:
city_coord.dtype

dtype('float64')

In [21]:
city_names = cities_pd['name'].values
city_names

array(['les Escaldes', 'Andorra la Vella', 'Umm al Qaywayn', ...,
       'Beitbridge', 'Epworth', 'Chitungwiza'], dtype=object)

# calculate the closest city directly

In [22]:
start_time = time.time()
clean_pd['city_id'] = clean_pd["tweet_coord"].map(lambda x: np.argmin([(float(x[0])- coord[0])**2 + (float(x[1])-coord[1])**2 for coord in city_coord]))
print("---.%s seconds---" % (time.time()-start_time))

---.41.793593883514404 seconds---


In [23]:
clean_pd['city_id'].head()

0    21569
1    21062
2    22020
3    22020
4    20796
Name: city_id, dtype: int64

In [24]:
clean_pd['city_name'] = clean_pd['city_id'].map(lambda x: city_names[x])

In [25]:
clean_pd['city_name'].head()

0    New York City
1          Chelsea
2       El Segundo
3       El Segundo
4           Frisco
Name: city_name, dtype: object

## calculate the distance

In [26]:
start_time = time.time()
clean_pd['distance'] = clean_pd["tweet_coord"].map(lambda x: [(float(x[0])- coord[0])**2 + (float(x[1])-coord[1])**2 for coord in city_coord])
print("---.%s seconds---" % (time.time()-start_time))

---.39.911250829696655 seconds---


In [27]:
clean_pd.shape

(855, 23)

In [28]:
#clean_pd['distance'] = new_pd
clean_pd.tail()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,...,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone,city_id,city_name,distance
850,681679736,False,finalized,3,2/25/15 19:01,negative,0.6694,Customer Service Issue,0.3451,American,...,0,@AmericanAir I have never on all my trips on a...,"[33.93939612, -118.38973148]",2/22/15 12:36,5.69597e+17,"Santa Monica, CA",Pacific Time (US & Canada),22020,El Segundo,"[14455.143756290496, 14452.02248169125, 30326...."
851,681679755,False,finalized,3,2/25/15 19:06,negative,1.0,Customer Service Issue,1.0,American,...,0,@AmericanAir we've been on hold for hours.,"[35.22534456, -106.57241352]",2/22/15 12:23,5.69593e+17,,,22321,Rio Rancho,"[11740.053643363759, 11737.239514814828, 26378..."
852,681679771,False,finalized,3,2/25/15 19:42,negative,1.0,Cancelled Flight,0.6333,American,...,0,@AmericanAir a friend is having flight Cancell...,"[40.46692522, -82.64567078]",2/22/15 12:16,5.69592e+17,Central Ohio,Eastern Time (US & Canada),21696,Mount Vernon,"[7090.403631392068, 7088.208749247991, 19321.5..."
853,681679783,False,finalized,3,2/25/15 19:06,negative,1.0,Can't Tell,1.0,American,...,0,"@AmericanAir Call me Chairman, or call me Emer...","[32.9070889, -97.03785947]",2/22/15 12:10,5.6959e+17,,,20803,Grapevine,"[9808.60294067412, 9806.039982241555, 23338.54..."
854,681679784,False,finalized,3,2/25/15 19:30,positive,1.0,,,American,...,0,@AmericanAir Flight 236 was great. Fantastic c...,"[40.64946781, -73.76624703]",2/22/15 12:08,5.6959e+17,East Coast,,21615,Springfield Gardens,"[5673.599790157447, 5671.636478430656, 16951.5..."


## find the nearest city

In [29]:
clean_pd['city_id'] = clean_pd['distance'].map(lambda x: np.argmin(x))
clean_pd['city_name'] = clean_pd['distance'].map(lambda x: city_names[np.argmin(x)])

In [30]:
#distance = clean_pd['distance'].values
#np.argmin(np.asarray(clean_pd.loc[21, 'distance']))
np.argmin(clean_pd.loc[21, 'distance'])

22187

In [31]:
#clean_pd[['city_id','city_name']]

In [32]:
#clean_pd.head()

In [33]:
np.unique(clean_pd['city_name'].values).shape

(291,)

## Check my calculation is right

In [34]:
cities_pd.loc[20796, ['latitude', 'longitude']]

latitude     33.1507
longitude   -96.8236
Name: 20796, dtype: object

In [35]:
(33.9192-33.94540417)**2 + (-118.416+118.4062472)**2

0.000781775633229281

## Get city name one hot

In [36]:
cityname_onehot = pd.get_dummies(cities_pd['name'], columns=['name'])

In [37]:
city_oh_array = cityname_onehot.values

In [38]:
cityname_onehot.columns

Index([''Ali Sabieh', ''s-Gravenzande', ''s-Hertogenbosch', 'A CoruÃ±a',
       'A Estrada', 'Aabenraa', 'Aachen', 'Aalborg', 'Aalen', 'Aalsmeer',
       ...
       'âIzrÄ', 'âAÃ¯n Abid', 'âAÃ¯n Benian', 'âAÃ¯n Deheb',
       'âAÃ¯n Merane', 'âAÃ¯n el Bell', 'âAÃ¯n el Berd',
       'âAÃ¯n el Hammam', 'âAÃ¯n el Melh', 'âAÃ¯n el Turk'],
      dtype='object', length=22162)

In [39]:
city_oh_array.shape

(23278, 22162)

In [40]:
cities_pd.shape

(23278, 19)

In [41]:
clean_pd['city_id'] = clean_pd['distance'].map(lambda x: np.argmin(x))
clean_pd['city_name_onehot'] = clean_pd['distance'].map(lambda x: city_oh_array[np.argmin(x)])
clean_pd['city_name'] = clean_pd['distance'].map(lambda x: city_names[np.argmin(x)])

## get airline_sentiment value

In [42]:
#as_onehot = pd.get_dummies(clean_pd['airline_sentiment'], columns=['airline_sentiment'])
sentiment = np.unique(clean_pd['airline_sentiment'].values).tolist()

In [43]:
clean_pd['y'] = clean_pd['airline_sentiment'].map(lambda x: sentiment.index(x))


In [44]:
len(clean_pd.index)

855

## Build train_df

0    [0 0 0 ... 0 0 0]
1    [0 0 0 ... 0 0 0]
2    [0 0 0 ... 0 0 0]
3    [0 0 0 ... 0 0 0]
4    [0 0 0 ... 0 0 0]
Name: x_str, dtype: object

In [45]:
clean_pd.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,...,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone,city_id,city_name,distance,city_name_onehot,y
0,681448197,False,finalized,3,2/25/15 2:26,positive,1.0,,,Virgin America,...,"[40.74804263, -73.99295302]",2/24/15 8:49,5.70264e+17,,,21569,New York City,"[5707.4367313605835, 5705.467404032634, 17013....","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2
1,681448213,False,finalized,3,2/25/15 9:04,negative,1.0,Bad Flight,1.0,Virgin America,...,"[42.361016, -71.02000488]",2/24/15 5:05,5.70208e+17,"San Mateo, CA & Las Vegas, NV",,21062,Chelsea,"[5264.125335351106, 5262.231988996239, 16303.3...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
2,681448214,False,finalized,3,2/25/15 9:14,neutral,0.615,,0.0,Virgin America,...,"[33.94540417, -118.4062472]",2/23/15 23:34,5.70125e+17,Brooklyn,Atlantic Time (Canada),22020,El Segundo,"[14459.002370651877, 14455.880658984288, 30332...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
3,681448223,False,finalized,3,2/25/15 1:57,negative,1.0,Customer Service Issue,1.0,Virgin America,...,"[33.94209449, -118.40410103]",2/23/15 21:10,5.70088e+17,Washington DC,Quito,22020,El Segundo,"[14458.544235494901, 14455.422583152029, 30332...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
4,681448228,False,finalized,3,2/25/15 1:01,positive,1.0,,,Virgin America,...,"[33.2145038, -96.9321504]",2/23/15 20:24,5.70077e+17,Texas,Central Time (US & Canada),20796,Frisco,"[9781.966220496042, 9779.405713655302, 23310.9...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2


In [164]:
features_df = pd.DataFrame(columns =['city_id', 'city_name', 'X', 'y', 'sentiment'])

In [165]:
features_df[['city_id', 'city_name', 'X', 'y', 'sentiment']] = clean_pd[['city_id', 'city_name', 'city_name_onehot', 'y', 'airline_sentiment']]

Unnamed: 0,city_id,city_name,X,y,sentiment
0,21569,New York City,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,positive
1,21062,Chelsea,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,negative
2,22020,El Segundo,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,neutral
3,22020,El Segundo,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,negative
4,20796,Frisco,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,positive


In [167]:
features_df["X"] = features_df["X"].map(lambda x: ' '.join([str(aa) for aa in x]))

In [168]:
features_df.head()

Unnamed: 0,city_id,city_name,X,y,sentiment
0,21569,New York City,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,2,positive
1,21062,Chelsea,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,0,negative
2,22020,El Segundo,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,1,neutral
3,22020,El Segundo,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,0,negative
4,20796,Frisco,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,2,positive


In [169]:
features_df.to_csv('features.csv', index=False, encoding = "ISO-8859-1")

In [131]:
a=[0,0,0,0,0,0]
b = " ".join(str(aa) for aa in a)
print(b)

0 0 0 0 0 0


In [139]:
aaa = ' '.join(["e", 'f', 'k'])
aaa

'e f k'

In [170]:
train_df = pd.read_csv("features.csv", encoding = "utf-8")

In [163]:
#train_df["X"]

In [171]:
train_df["X"] = train_df["X"].map(lambda x: x.split(" "))

In [172]:
train_df["X"].head()

0    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: X, dtype: object

## Train the model

In [50]:
#clean_pd[['city_name_onehot', 'y']]

In [173]:
X = train_df["X"].values.tolist() #clean_pd['city_name_onehot'].values.tolist() #train_df["X"].values.tolist()
y = train_df["y"].values.tolist()

In [174]:
X = np.asarray(X).astype(float)

In [175]:
X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [103]:
#x = np.array([['1.1', '2.2', '3.3']])
#z = x.astype(np.float)
#z

array([[1.1, 2.2, 3.3]])

In [54]:
#y

In [176]:
clf = LogisticRegression(random_state=0, solver='lbfgs',
                         multi_class='multinomial').fit(X, y)

In [162]:
clean_pd.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,...,tweet_created,tweet_id,tweet_location,user_timezone,city_id,city_name,distance,city_name_onehot,y,x_str
0,681448197,False,finalized,3,2/25/15 2:26,positive,1.0,,,Virgin America,...,2/24/15 8:49,5.70264e+17,,,21569,New York City,"[5707.4367313605835, 5705.467404032634, 17013....","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,[0 0 0 ... 0 0 0]
1,681448213,False,finalized,3,2/25/15 9:04,negative,1.0,Bad Flight,1.0,Virgin America,...,2/24/15 5:05,5.70208e+17,"San Mateo, CA & Las Vegas, NV",,21062,Chelsea,"[5264.125335351106, 5262.231988996239, 16303.3...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,[0 0 0 ... 0 0 0]
2,681448214,False,finalized,3,2/25/15 9:14,neutral,0.615,,0.0,Virgin America,...,2/23/15 23:34,5.70125e+17,Brooklyn,Atlantic Time (Canada),22020,El Segundo,"[14459.002370651877, 14455.880658984288, 30332...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,[0 0 0 ... 0 0 0]
3,681448223,False,finalized,3,2/25/15 1:57,negative,1.0,Customer Service Issue,1.0,Virgin America,...,2/23/15 21:10,5.70088e+17,Washington DC,Quito,22020,El Segundo,"[14458.544235494901, 14455.422583152029, 30332...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,[0 0 0 ... 0 0 0]
4,681448228,False,finalized,3,2/25/15 1:01,positive,1.0,,,Virgin America,...,2/23/15 20:24,5.70077e+17,Texas,Central Time (US & Canada),20796,Frisco,"[9781.966220496042, 9779.405713655302, 23310.9...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,[0 0 0 ... 0 0 0]


In [177]:
clf.predict_proba(X[5:20, :])

array([[0.45511089, 0.36620047, 0.17868863],
       [0.45511089, 0.36620047, 0.17868863],
       [0.73749134, 0.16688061, 0.09562805],
       [0.67601554, 0.07088513, 0.25309934],
       [0.45729415, 0.30017348, 0.24253236],
       [0.6268758 , 0.14585264, 0.22727156],
       [0.63050772, 0.1219309 , 0.24756138],
       [0.72622313, 0.13118368, 0.14259318],
       [0.73749134, 0.16688061, 0.09562805],
       [0.63050772, 0.1219309 , 0.24756138],
       [0.81383511, 0.09116459, 0.0950003 ],
       [0.68285654, 0.08301075, 0.23413271],
       [0.68285654, 0.08301075, 0.23413271],
       [0.6268758 , 0.14585264, 0.22727156],
       [0.68285654, 0.08301075, 0.23413271]])

In [178]:
# Output a pickle file for the model
joblib.dump(clf, 'saved_model.pkl') 

['saved_model.pkl']

In [179]:
clf_load = joblib.load('saved_model.pkl') 

In [180]:
y_pred = clf_load.predict(X)

In [181]:
print(np.sum(y_pred==y)/len(y_pred))

0.6970760233918128


In [182]:
y_pred_prob = clf_load.predict_proba(X)

In [183]:
y_pred_prob

array([[0.48150977, 0.19796215, 0.32052807],
       [0.6083696 , 0.19267507, 0.19895533],
       [0.63050772, 0.1219309 , 0.24756138],
       ...,
       [0.77536881, 0.1079688 , 0.1166624 ],
       [0.80175037, 0.06938709, 0.12886254],
       [0.73749134, 0.16688061, 0.09562805]])

## Create output

In [185]:
sentiments = np.unique(train_df['sentiment'].values).tolist()
sentiments

['negative', 'neutral', 'positive']

In [187]:
pred_prob_df = pd.DataFrame(y_pred_prob, columns = sentiments)

Unnamed: 0,negative,neutral,positive,city_name
0,0.48151,0.197962,0.320528,New York City
1,0.60837,0.192675,0.198955,Chelsea
2,0.630508,0.121931,0.247561,El Segundo
3,0.630508,0.121931,0.247561,El Segundo
4,0.457919,0.124169,0.417912,Frisco


In [190]:
pred_prob_df["city_name"] = clean_pd["city_name"]

In [199]:
pred_prob_df = pred_prob_df.drop_duplicates()

In [200]:
print(pred_prob_df.shape)
pred_prob_df.head()

(291, 4)


Unnamed: 0,negative,neutral,positive,city_name
549,0.274621,0.118725,0.606654,The Valley
402,0.274621,0.118725,0.606654,Meadow Woods
748,0.340416,0.138151,0.521433,Costa Mesa
322,0.340416,0.138151,0.521433,Salt Lake City
110,0.302891,0.256372,0.440736,Bergeijk


In [201]:
np.unique(pred_prob_df["city_name"]).shape

(291,)

In [202]:
pred_prob_df=pred_prob_df.sort_values('positive', ascending=False)

In [203]:
pred_prob_df.head(100)

Unnamed: 0,negative,neutral,positive,city_name
549,0.274621,0.118725,0.606654,The Valley
402,0.274621,0.118725,0.606654,Meadow Woods
748,0.340416,0.138151,0.521433,Costa Mesa
322,0.340416,0.138151,0.521433,Salt Lake City
110,0.302891,0.256372,0.440736,Bergeijk
4,0.457919,0.124169,0.417912,Frisco
705,0.457919,0.124169,0.417912,White Plains
324,0.449932,0.162464,0.387604,Lubbock
203,0.449932,0.162464,0.387604,Etobicoke
245,0.449932,0.162464,0.387604,Farragut


## Old code

In [67]:
clean_pd[sentiments] = pred_prob_df[sentiments]

In [68]:
#pred_prob_df[['name']] = clean_pd[['name']]
clean_pd.shape

(855, 28)

In [69]:
#pred_prob_df.head(100)
o_df = clean_pd[['negative','neutral','positive' ,'city_name']]

In [70]:
o_df=o_df.sort_values('positive', ascending=False)
o_df.shape

(855, 4)

In [71]:
xy = o_df.values

In [72]:
xy

array([[0.4815097739548978, 0.19796215120228494, 0.3205280748428173,
        'Washington, D.C.'],
       [0.570044336053005, 0.13544731434386229, 0.2945083496031328,
        'Kenner'],
       [0.6089094378451265, 0.10623700149381524, 0.28485356066105827,
        'Calgary'],
       ...,
       [nan, nan, nan, 'Mount Vernon'],
       [nan, nan, nan, 'Grapevine'],
       [nan, nan, nan, 'Springfield Gardens']], dtype=object)

In [73]:
clean_pd.tail()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,...,tweet_location,user_timezone,distance,city_id,city_name,city_name_onehot,y,negative,neutral,positive
14577,681679736,False,finalized,3,2/25/15 19:01,negative,0.6694,Customer Service Issue,0.3451,American,...,"Santa Monica, CA",Pacific Time (US & Canada),"[14455.143756290496, 14452.02248169125, 30326....",22020,El Segundo,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,,,
14596,681679755,False,finalized,3,2/25/15 19:06,negative,1.0,Customer Service Issue,1.0,American,...,,,"[11740.053643363759, 11737.239514814828, 26378...",22321,Rio Rancho,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,,,
14612,681679771,False,finalized,3,2/25/15 19:42,negative,1.0,Cancelled Flight,0.6333,American,...,Central Ohio,Eastern Time (US & Canada),"[7090.403631392068, 7088.208749247991, 19321.5...",21696,Mount Vernon,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,,,
14624,681679783,False,finalized,3,2/25/15 19:06,negative,1.0,Can't Tell,1.0,American,...,,,"[9808.60294067412, 9806.039982241555, 23338.54...",20803,Grapevine,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,,,
14625,681679784,False,finalized,3,2/25/15 19:30,positive,1.0,,,American,...,East Coast,,"[5673.599790157447, 5671.636478430656, 16951.5...",21615,Springfield Gardens,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,,,


## Old temp codes

In [74]:
#clean_pd['name']

In [75]:
from sklearn.datasets import load_iris

X1, y1 = load_iris(return_X_y=True)

In [76]:
X1

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [77]:
y1

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [78]:
at_pd = airline_tweets_pd[airline_tweets_pd["tweet_coord"].str.contains('[[0-9]+.[0-9]+, [0-9]+.[0-9]+]')]

ValueError: cannot index with vector containing NA / NaN values

In [None]:
#airline_tweets_pd[airline_tweets_pd["tweet_coord"].isnull()].size
#airline_tweets_pd["tweet_coord"] #.isnull()
at_pd_temp = airline_tweets_pd[~airline_tweets_pd["tweet_coord"].str.contains('nan', case = False)]

In [None]:
at_pd_temp.size

In [None]:
at_pd_temp["tweet_coord"].head()

In [None]:
at_pd_temp1 = at_pd_temp[~at_pd_temp["tweet_coord"].str.contains('[0.0, 0.0]', regex = False)]
at_pd_temp1.size

In [None]:
at_pd_temp2 = at_pd_temp1[at_pd_temp1["tweet_coord"].str.contains('0.0]', regex = False)]
at_pd_temp2.size

In [None]:
at_pd_temp1["tweet_coord"].head()

In [None]:
print(airline_tweets_pd.size)
print(at_pd.size)
at_pd["tweet_coord"].head()

In [None]:
at_pd_new = at_pd[~at_pd["tweet_coord"].str.contains('[0.0, 0.0]')]

In [None]:
at_pd_new["tweet_coord"].head()

In [None]:
#airline_tweets_pd.dtypes
#airline_tweets_pd = airline_tweets_pd.astype({"tweet_coord":str}) # = airline_tweets_pd["tweet_coord"].apply(pd.to)
airline_tweets_pd["tweet_coord"] = airline_tweets_pd["tweet_coord"].astype('str')

In [None]:
airline_tweets_pd["tweet_coord"].dtypes

In [None]:
at_pd.head()

In [None]:
#geonameid
print(len(cities_pd['name'].values))
print(np.unique(cities_pd['name'].values).shape)

print(len(cities_pd['geonameid'].values))
print(np.unique(cities_pd['geonameid'].values).shape)

In [None]:
d_df = cities_pd[cities_pd['name'].duplicated()]

In [None]:
d_df[d_df['name']=='Mercedes']  #.head(20)