# Data Management: Merge Weather Data

## This notebook:
1. Merges weather data of NYC in 2016 with ('merged2_event_CB2016_800m.csv')

### Data Source:
https://www.kaggle.com/mathijs/weather-data-in-new-york-city-2016/data

In [1]:
import pandas as pd
import statsmodels.formula.api as smf

# Data Cleaning

In [2]:
weather = pd.read_csv('weather_2016.csv')
weather.head()

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,1-1-2016,42,34,38.0,0.0,0.0,0
1,2-1-2016,40,32,36.0,0.0,0.0,0
2,3-1-2016,45,35,40.0,0.0,0.0,0
3,4-1-2016,36,14,25.0,0.0,0.0,0
4,5-1-2016,29,11,20.0,0.0,0.0,0


In [3]:
w = weather[['date','average temperature','precipitation']]
w.head()

Unnamed: 0,date,average temperature,precipitation
0,1-1-2016,38.0,0.0
1,2-1-2016,36.0,0.0
2,3-1-2016,40.0,0.0
3,4-1-2016,25.0,0.0
4,5-1-2016,20.0,0.0


In [4]:
data = pd.read_csv('merged2_event_CB2016_800m.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,...,usertype,birth year,gender,startdate,stopdate,Event_type,End_Time,weekday,O_date,O_hour
0,0,173,16:03:10,16:06:03,243,Fulton St & Rockwell Pl,40.688226,-73.979382,241,DeKalb Ave & S Portland Ave,...,Subscriber,1971.0,2,2016-01-01,2016-01-01,no-event,,,2016-01-01,16
1,1,136,16:05:54,16:08:11,420,Clermont Ave & Lafayette Ave,40.687645,-73.969689,270,Adelphi St & Myrtle Ave,...,Subscriber,1980.0,1,2016-01-01,2016-01-01,no-event,,,2016-01-01,16
2,2,653,16:13:47,16:24:40,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,278,Concord St & Bridge St,...,Subscriber,1976.0,1,2016-01-01,2016-01-01,no-event,,,2016-01-01,16
3,3,659,16:13:47,16:24:46,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,278,Concord St & Bridge St,...,Subscriber,1985.0,2,2016-01-01,2016-01-01,no-event,,,2016-01-01,16
4,4,1419,16:20:39,16:44:19,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,532,S 5 Pl & S 4 St,...,Subscriber,1993.0,1,2016-01-01,2016-01-01,no-event,,,2016-01-01,16


In [5]:
d = pd.DataFrame(data.groupby(['O_date','Event_type','O_hour'],as_index=False).size())
d.reset_index(inplace=True)
d.columns = ['O_date','Event_type','O_hour','Count']
d.head()

Unnamed: 0,O_date,Event_type,O_hour,Count
0,2016-01-01,no-event,16,8
1,2016-01-01,no-event,17,7
2,2016-01-01,no-event,18,4
3,2016-01-01,no-event,19,2
4,2016-01-01,no-event,20,2


# Data Merging

In [6]:
w['date'] = pd.to_datetime(w.date)
d['O_date'] = pd.to_datetime(d.O_date)
d = d.merge(w,left_on='O_date',right_on='date',how='left')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [7]:
d.rename(columns={'average temperature':'temperature'},inplace=True)

# Data Engineering

In [8]:
# modify precipitation data type
d['precipitation'] = pd.to_numeric(d['precipitation'],errors='coerce')
d['precipitation'][d['precipitation'].isnull()] = 0
d.describe()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,O_hour,Count,temperature,precipitation
count,2879.0,2879.0,2879.0,2879.0
mean,19.489406,20.891976,57.61966,0.109851
std,2.288399,17.545349,16.906403,0.289087
min,16.0,1.0,7.0,0.0
25%,17.0,9.0,44.5,0.0
50%,19.0,16.0,56.5,0.0
75%,21.0,27.0,74.0,0.04
max,23.0,109.0,88.5,2.2


In [9]:
# generate weekday, month, season
d['weekday'] = [i not in [5,6] for i in d['O_date'].dt.weekday.values]
d['O_month'] = pd.to_datetime(d['O_date']).dt.month
d['season'] = d['O_month'].map({1: 1,
                              2: 1,
                              3: 2,
                              4: 2,
                              5: 2,
                              6: 3,
                              7: 3,
                              8: 3,
                              9: 4,
                              10: 4,
                              11: 4,
                              12: 1})
d.head()

Unnamed: 0,O_date,Event_type,O_hour,Count,date,temperature,precipitation,weekday,O_month,season
0,2016-01-01,no-event,16,8,2016-01-01,38.0,0.0,True,1,1
1,2016-01-01,no-event,17,7,2016-01-01,38.0,0.0,True,1,1
2,2016-01-01,no-event,18,4,2016-01-01,38.0,0.0,True,1,1
3,2016-01-01,no-event,19,2,2016-01-01,38.0,0.0,True,1,1
4,2016-01-01,no-event,20,2,2016-01-01,38.0,0.0,True,1,1


In [10]:
d.to_csv('merged3_final.csv')