# Data manipulation, transformation and junction. 

A startup just started operating in São Paulo. The startup name is [Yellow](https://www.yellow.app/?gclid=Cj0KCQjwlqLdBRCKARIsAPxTGaXuc2q-r_SnOuWBHji4ZmSEQ6gqHbGbSFgyyfGwqAJy07vK2la1VbkaAstTEALw_wc). This company offers to his users the possibility to use bikes spread all over the city. All you need to do is donwload their app, put some credit and start using. 

![yellow bike](https://portalbr.akamaized.net/brasil/uploads/2018/08/03141321/yellow-capa.jpg 'One of the bikes you can use in São Paulo')

This little yellow ones are really increasing and the people are using a lot ! So, with this in mind i started to question my self if there was any data similar to this. Thanks to Kaggle the answer is YES, there are a lot of data about bike sharing out there ! 

After a quick research a decided to test myself in the quest about some answers about this bussines model. They do got success in another places ? Who uses this ? Why and when ? 

To answer this questions i started my exploration with data about bike sharing in Seattle - USA. This data is divided in three sets, as described below: 

* trip dataset: It have the information abou the trips. Data like, trip duration, user gender, user type, trip date and more;
* station dataset: It have data about the localization of the bike stations spread in seattle;
* weather: Data about climate

We can found data since 2014. 

To make a good EDA on this data, i think we should join the most relevant data in only one dataset. Thats my chalenge !

So i hope you come with me in this journey, with my fellow Pandas and other librarys that well help us all over the way ! 

I hope you enjoy the trip and if you like what you see, please, leave you comment and your Upvote would be more than welcome ! 

Lets do this ! 

In [None]:
# importing the librarys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from dateutil.parser import parse # Helps to format strins into date

%matplotlib inline

In [None]:
# importing each dataset

trip = pd.read_csv('../input/trip.csv', error_bad_lines=False)


station = pd.read_csv('../input/station.csv',error_bad_lines=False)


weather = pd.read_csv('../input/weather.csv',error_bad_lines=False)

In [None]:
trip.head(10)

In [None]:
trip.info()

In [None]:
station.head()

In [None]:
station.info()

In [None]:
weather.head()

In [None]:
weather.info()

****Lets explore this datasets. I will start with the trips dataset, but before i start the exploration, there are some things i should do first:****

* Transform the time columns in time series (I chalenging myself, thats the first time i do this)
* Join the lat e long in the trip dataset, this way i can do some geoanalysis

# Data Manipulation on Trips Dataset

In [None]:
trip.columns

In [None]:
# lets drop the id columns that that are unnecessary

trip.drop(['trip_id','bikeid'], axis = 1,   inplace = True) # This command drop off the columns we pass as argument, the axis=1 condition,
#makes the drop on columns, the inplace = True, save the alteration on the dataset

In [None]:
trip.info()

In [None]:
# Tranforming the starttime column

data = trip['starttime']

In [None]:
dta = list(trip['starttime']) # transforms each element of the startime column into a string
dta = pd.to_datetime(dta)  # In the format of string, each element of the list is transformed into date by the pandas
trip['starttime'] = dta # Saving column changes

In [None]:
# Checking if everything went right
trip.info()

In [None]:
trip.starttime

In [None]:
trip.head()

In [None]:
# in fact we just need the starttime column, since we already have the variable duration of the trip

trip.drop(columns='stoptime', axis = 1, inplace=True)

In [None]:
trip.info()

### Now, it would be a good idea to find out the age of the users once we have their date of birth.

In [None]:
trip.isnull().sum()

In [None]:
trip.birthyear.describe()

In [None]:
# Filling in the missing values with values between 1969 and 1989 (which is the range in which most of the data is).
trip.birthyear.fillna(value = np.random.randint(1969,1989), inplace=True)
    

In [None]:
trip.info()

In [None]:
trip.birthyear.describe()

In [None]:
year = trip.starttime

def age(year):
    '''This function extracts the year from each element of the starttime column' '''
    age = []
    for i in year.index:  # get each element in the index of the variable 'year'
        a = str(year[i])  # 'i' represents each element of the index of the variable 'year', so each time 'for' identifies a
        # number in the index it plays within the variable 'a' that selects an item from the variable 'year' 
        b = a.split('-')[0] # variable 'b', stores the result of the .split () method applied on variable 'a', in
        # Then I extract the first element of the result from .split (), which is the year
        c = pd.to_numeric(b)  # converts the string year, to number
        

        age.append(c.astype(int)) # stores 'c' in the 'age' list, created at the beginning of the function
    return age

In [None]:
# usando a função e armazenando o resultado em uma variável
aged = age(year)

In [None]:
trip['age'] = aged - trip.birthyear
trip['age'] = trip['age'].astype(int)

In [None]:
trip.columns

In [None]:
trip.head()

In [None]:
# Populating missing values from the gender column
trip.gender.value_counts()

In [None]:
trip.gender.isnull().sum()

In [None]:
# Using the fillna method with the 'ffill' parameter to populate the null values ​​with the next valid observation of the dataset
gender = trip.gender.fillna(method='ffill')
trip.gender = gender
trip.gender.isnull().sum()

In [None]:
trip.head()

In [None]:
station.head()

### Merging data!

The pandas merge shank, serves to gather data from different bases. Unlike concat mode, this method "joins" the data side by side. Think of it as the 'procv' function of Excel. To merge, it is necessary that both dataframes have at least one column with the same name.

In [None]:
# creating the 'from_station_id' and 'to_station_id' columns in the dataset station

station['from_station_id'] = station.station_id
station['to_station_id'] = station.station_id
station.head()


In [None]:
# creating another dataset, only with the 'from_station_id' column and the location data
from_station = station[['lat', 'long','from_station_id']]
from_station.head()

In [None]:
# Including the latitude and longitude of the start stations in a new dataset: trip2

trip2 = pd.merge(trip,from_station, on='from_station_id')

In [None]:
trip2.info()

In [None]:
# identifying the new columns as the data of the place of departure
trip2.columns = ['starttime', 'tripduration', 'from_station_name',
       'to_station_name', 'from_station_id', 'to_station_id', 'usertype',
       'gender', 'birthyear', 'idade', 'from_lat', 'from_long']
trip2.columns

In [None]:
# creating another dataset, only with the column 'to_station_id'
to_station = station[['lat', 'long','to_station_id']]
to_station.head()

In [None]:
# Including the latitude and longitude of the start stations in a new dataset: trip3

trip3 = pd.merge(trip2,to_station, on='to_station_id')

In [None]:
trip3.columns

In [None]:
# identifying the columns of the arrival data
trip3.columns = ['starttime', 'tripduration', 'from_station_name',
       'to_station_name', 'from_station_id', 'to_station_id', 'usertype',
       'gender', 'birthyear', 'idade', 'from_lat', 'from_long', 'to_lat', 'to_long']

In [None]:
trip3.head()

In [None]:
trip3.info()

### For the dataset to be complete, we need to include the weather data.

Before that, I need to familiarize myself with the variables in the dataset 'weather'. But first let's take a look at where the bike stations are.

In [None]:
# Folium is the library that allows plotting with maps, very simple to use

import folium

In [None]:
station.columns

In [None]:
mapa = folium.Map(location=[ 47.608013,  -122.335167], zoom_start=12) # Determining the seattle map using latitude and longitude data
lat = station['lat'].values # taking the latitude values from the stations of the dataset station
long = station['long'].values # taking the values of longitude of the stations of the dataset station

for la, lo in zip(lat, long): # for each value in lat and long...
    folium.Marker([la, lo]).add_to(mapa) # create a marker and place in the map variable (which in this case is the map of Seattle)
mapa # Show the Map

### And voi la ! 

### Theres is our map !

In [None]:
trip3.from_station_name.value_counts().head(10)

# Let's see the 10 most popular stations on the map


In [None]:

estacoes_mais_pop = pd.DataFrame(trip3.from_station_name.value_counts().head(10)) # Counting the 10 plus creating a new df to be able to pass
# for the folium
station_2 = station[['name','lat', 'long' ]]
station_2.columns = ['from_station_name','lat', 'long']

In [None]:
estacoes_mais_pop = estacoes_mais_pop.reset_index() # resetting the index to adjust the name of the columns

In [None]:
estacoes_mais_pop # note that the column with the station name is named 'index'

In [None]:
estacoes_mais_pop.columns = ['from_station_name','contagem'] # Correcting the problem by simply renaming the columns

In [None]:
estacoes_mais_pop

In [None]:
estacoes_mais_pop = pd.merge(estacoes_mais_pop, station_2, on='from_station_name') # including location data (lat and long) using merge again

In [None]:
estacoes_mais_pop

In [None]:
mapa2 = folium.Map(location=[47.608013,  -122.335167], zoom_start=13) # Same process as above, but we need to create a new Map

lat = estacoes_mais_pop['lat'] 
long = estacoes_mais_pop['long'] 

# This time I wrote line by line because I wanted to include the name of the station on the map. I could not find a more practical way to do it,
# for a while...

folium.Marker([47.614315, -122.354093],popup='Pier 69 / Alaskan Way & Clay St').add_to(mapa2)
folium.Marker([47.615330 ,-122.311752],popup='E Pine St & 16th Ave').add_to(mapa2)
folium.Marker([47.618418 ,-122.350964],popup='3rd Ave & Broad St ').add_to(mapa2)
folium.Marker([47.610185 ,-122.339641],popup='2nd Ave & Pine St').add_to(mapa2)
folium.Marker([47.613628 ,-122.337341],popup='Westlake Ave & 6th Ave').add_to(mapa2)
folium.Marker([47.622063 ,-122.321251],popup='E Harrison St & Broadway Ave E ').add_to(mapa2)
folium.Marker([47.615486 ,-122.318245],popup='Cal Anderson Park / 11th Ave & Pine St').add_to(mapa2)
folium.Marker([47.619859 ,-122.330304],popup='REI / Yale Ave N & John St ').add_to(mapa2)
folium.Marker([47.615829 ,-122.348564],popup='2nd Ave & Vine St').add_to(mapa2)
folium.Marker([47.620712 ,-122.312805],popup='15th Ave E & E Thomas St').add_to(mapa2)

mapa2

### So far, so close 

So lets see the location of the first three stations !

# 1st Pier 69

![Pier 69 - Seattle](http://www.gonorthwest.com/Washington/seattle/Waterfront/images/DSC_2184.jpg)

# 2nd E Pine St / 16th Ave

![E Pine St / 16th Ave](https://t-ec.bstatic.com/images/hotel/max1024x768/539/53967962.jpg)

# 3rd 3rd Ave & 16th Ave
![3rd Ave & 16th Ave](https://cdn.downtownseattle.org/app/uploads/2017/10/Metro-on-3rd-high-angle-2-2.jpg)


In [None]:
# Evaluating the weather dataset
weather.head(10)

In [None]:
# Evaluating the dataset trip3, remembering that this dataset contains the location data
trip3.head()

In [None]:
data_str = list(trip3.starttime) # Creating a new date column, with the same date format as the weather dataset
# this will allow you to add the weather data on the trip.

In [None]:
data_str

In [None]:
data_str = [datetime.strftime(x, '%Y-%m-%d') for x in data_str] # Formatting the column using datetime

In [None]:
data_str[:5]

In [None]:
trip3['Date'] = data_str # Adding the column

In [None]:
type(weather.Date)

In [None]:
trip3.head() # Confirming column

In [None]:
trip3.Date.dtypes

In [None]:
weather.Date.dtypes

In [None]:
# using the same method used in the starttime column of the dataset trip this is necessary because the Date columns of trip3 and weather


dt = list(weather['Date']) # transforms each element of the Date column into a string
dt = pd.to_datetime(dt)  # In the string format, each list element is transformed into a date by the pandas

weather['Date'] = dt # Saving the changes

In [None]:
weather.head()

In [None]:
trip3.Date = pd.to_datetime(trip3.Date)

In [None]:
trip4 = pd.merge(weather,trip3, on = 'Date')

In [None]:
trip4.info()

In [None]:
trip4.head()

Now we have a huge dataset with 35 columns with data about use, weather and localization. But there are some null values. Lets work on it ! 

### Mean_Temperature_F

In [None]:
# We have 110 null values in this colunm
trip4.Mean_Temperature_F.isnull().sum()

In [None]:
trip4.Mean_Temperature_F.describe()

In [None]:
# Let us fill in the missing data with the mean value, plus or minus the standard deviation
trip4.Mean_Temperature_F = trip4.Mean_Temperature_F.fillna(value = np.random.randint(48,68))

In [None]:
trip4.Mean_Temperature_F.describe()

In [None]:
trip4.Mean_Temperature_F.isnull().sum()

In [None]:
trip4.isnull().sum()

### Max_Gust_Speed_MPH

In [None]:
trip4.Max_Gust_Speed_MPH.describe()

this column should be numeric, but has to many null values and i dont know what it means. sou, we gonna drop it.

In [None]:
trip4.drop(columns='Max_Gust_Speed_MPH', axis=1, inplace=True)

In [None]:
trip4.isnull().sum()

### Events

I think this information is very important, but has too many NaN values, that must be days that dont ocurried any event

In [None]:
trip4.Events.describe()

In [None]:
trip4.Events.value_counts()

In [None]:
events = trip4.Events

In [None]:
events.replace('Rain , Thunderstorm', 'Rain-Thunderstorm', inplace = True)
events.replace('Rain , Snow', 'Rain-Snow', inplace = True)
events.replace('Fog , Rain', 'Rain-Snow', inplace = True)
events.value_counts()

In [None]:
events.fillna(value='No-Event', inplace=True)

In [None]:
events.isnull().sum()

In [None]:
events.value_counts()

In [None]:
trip4.info()

Lets choose the columns we are interested in

In [None]:
columns_to_drop = ['Max_Temperature_F','Min_TemperatureF', 'Max_Dew_Point_F', 'MeanDew_Point_F', 'Min_Dewpoint_F',
                   'Max_Humidity', 'Min_Humidity','Max_Sea_Level_Pressure_In', 'Min_Sea_Level_Pressure_In', 'Max_Visibility_Miles',
                   'Min_Visibility_Miles', 'Max_Wind_Speed_MPH']                 

In [None]:
# converting the trip duration from seconds to minutes
trip4.tripduration = trip4.tripduration / 60

In [None]:
trip5 = trip4.drop(columns= columns_to_drop, axis=1)

In [None]:
trip5.columns

In [None]:
trip5.head()

In [None]:
trip5.describe()

### Next steps  

My goal is:
 * separete the date into new columns: Year, month, day and hour
 * from the hour column create a new feature that identifies the time of the day:Morning, afternoon and evening, so we can understand the time of day when people use the bike most
 * From the month column, extract the season of the year: Summer, spring, winter and autumn. 
 * create days of the week
 
I think this changes can give us a real understanding about the use patterns

***The work continues. So, to finish these transformations, we will begin the exploratory analysis of the data!***



Thank you so much for coming here. Leave your comment and do not forget to follow the end of this work !!

See you later !