**Hi, Everyone ! We are going to get into the intro of the python programming and data visualization. **

![NewYorkTaxi](https://steamcdn-a.akamaihd.net/steam/apps/446470/header.jpg?t=1457089684)

Suppose you are hired by the New York City Transportation to analyze the GPS data of the yellow cabs. 

**New York Taxi Background: **
The taxi and livery system in New York City is the fourth largest transportation provider in the United States. The system is regulated by the New York City Taxi and Limousine Commission (TLC), which oversees yellow taxis, for-hire vehicles, commuter vans, paratransit vehicles, and certain limousines. Despite the scale of the taxi and livery network, the existing system of yellow taxis and for-hire vehicles did not adequately serve all of the boroughs of New York City. Riders in Queens, the Bronx, Brooklyn, Staten Island, and Upper Manhattan had been left out in the cold, too often literally. 

**Problems with the New York Taxi: **
The pick-up locations are too concentrated on Mid-town and Airport areas. That is to say inadequate taxi service across the five boroughs of New York City prompted the City to undertake an assessment of the existing regulations that governed taxi service to identify opportunities for reform.

To rephrase the question here: How to balance the supply and demand in New York city cabs ? 

**The government collect the following data:**
The competition dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this playground competition. Based on individual trip attributes, participants should predict the duration of each trip in the test set. **( Data Description by Kaggle) **

**Data fields:**
* id - a unique identifier for each trip
* vendor_id - a code indicating the provider associated with the trip record
* pickup_datetime - date and time when the meter was engaged
* dropoff_datetime - date and time when the meter was disengaged
* passenger_count - the number of passengers in the vehicle (driver entered value)
* pickup_longitude - the longitude where the meter was engaged
* pickup_latitude - the latitude where the meter was engaged
* dropoff_longitude - the longitude where the meter was disengaged
* dropoff_latitude - the latitude where the meter was disengaged
* store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* trip_duration - duration of the trip in seconds

**Think about the following questions: <br/>**
(1) In order to solve the problem, what other data do you need except for the Geospatial data ?   <br/>
(2) Given the sample data, could you think of what kind of things you need to know before you really do the analysis ? Please hand draw graphs to show what you think ! 


**Python Basics:** <br/> 

*  Open Source:  Anyone could contribute and use without liscense </br> 
*  Object Oriented Design: 
*  Memory Based 
* Flexibility and multifunction

----------------------------------------------------------------------------------------------------------------------
**What we aim to achieve today ? **

* Understand what are packages/libraries ? 

* What are the visualization packages in Python ? 

* How to plot/draw visualization tools in Python/ any programming languages ? 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import folium as folium #folium would be the map concentrated package to draw fancy map on the canvas and folium is a wrapper of leafjs 
import seaborn as sns # another pretty plot package that is based on matplot package in python
import missingno as msno #showing the missing value of the dataset
import matplotlib.pyplot as plt #basic plots for matplot 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

In [None]:
from subprocess import check_output
print(check_output(["ls", "../input/nyc-taxi-trip-duration/train/"]).decode("utf8"))



In [None]:
#check out the data
titanic = pd.read_csv("../input/titantic/train.csv")
#train = pd.read_csv("../input/ny-taxi-trip-duration/train.zip/train.csv") 

#find the path in the file path and output the file
train=pd.read_csv("../input/nyc-taxi-trip-duration/train/train.csv")
train.head()


In [None]:
#there are two ways to make a basic summary in python compared to R, describe function and info function 
summary = train.describe() 
summary 

In [None]:
#Like SAS,you could input what you want for decile and output the summary function
perc = [0.2,0.4,0.6,0.8]
include = ['object','float','int']
desc = train.describe(percentiles = perc,include= include)
desc


In [None]:
#panda dataframe could check the over missing values
train.info()

**There are two ways to check missing values, one is to use the pandas info function, the other is to use visualization technique. Python has  data visualization packages to manipulate different kinds of datasets. The most common ones for static data are matplotlib and plotly. These two packages could always offer the best basic plots for two dimensional data. Later on, python developed more advanced data visualization packages like Seaborn, and Bokeh, which are interactive and javascript based**

In [None]:
#ways to visualizing the missing values,in thie particular dataset,wecould see that none of the data has missing values. 
sns.heatmap(train.isnull(), cbar=False)


In [None]:
#another example of showing different missing value datasets. 
sns.heatmap(titanic.isnull(),cbar=False)

In [None]:
#python has another package missing no speicalizing in missing value data visualization 
msno.matrix(titanic)

In [None]:
#Showing Distribution of the dataset 
sns.distplot(train['trip_duration'],color = 'skyblue',label = 'trip_duration')
plt.legend()
#plt.xlim(0,35000)
plt.show()




In [None]:
#conver all the seconds into minutes and plot a better duration 
train['trip_dur_to_m'] = round(train['trip_duration']/60,0)
train.head(10)

#generate a distribution/freq table 
train.trip_dur_to_m.value_counts(sort=True) 

In [None]:
#plot the duration distribution again
#initialize a figuresize plot
fig, ax = plt.subplots(figsize=(14, 4))
tripduration = train[train.trip_dur_to_m < train.trip_dur_to_m.quantile(.97)]
tripduration.groupby('trip_dur_to_m').count()['id'].plot()

#add the label to each plot 
plt.xlabel('Trip duration in minutes')
plt.ylabel('Trip count')
plt.title('Duration distribution')
plt.show()

In [None]:
#Showing the passagers Distribution 
sns.distplot(train['passenger_count'],color = 'red',label = 'passenger_count')
plt.legend()
plt.show()

In [None]:
#the improved way to plot this distribution graph 
#distribution plot 
sns.distplot(train.passenger_count,color = 'orange',kde=False, bins=train.passenger_count.max(), 
                vertical=True, axlabel="Passengers distribution");
train.passenger_count.value_counts(sort=False)

In [None]:
#plot by vendor id 
vendor_id_count = train['vendor_id'].value_counts() 
#the data structure here is series, so this one should be sorted by inex 
vendor_id_count.sort_index 



In [None]:
#barplot example for vendor id 
sns.barplot(x= vendor_id_count.values,y= vendor_id_count.index,data=vendor_id_count,palette='Set2')
plt.xlabel('vendor_id')
plt.ylabel('total rides')
plt.show() 

In [None]:
train.head(10)

In [None]:
tripduration = train[train.trip_dur_to_m < train.trip_dur_to_m.quantile(.97)]
tripduration.head()

In [None]:
#extract date/time from the dataset
#extract columns from two data timestamps
def extract_time_interval(df,colname,start,end):
    df_c = df.copy() 
    df_c[f'{colname}'] = (df_c[end] - df_c[start]).astype('timedelta64[m]')
    return df_c 

import datetime

#extract all the date and time 
#type(train['pickup_datetime'])
def datetime_extract(df, columns, modeling=False):
    df_ = df.copy()
    day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    for col in columns:
        try:
            prefix = col
            if "_" in col:
                prefix = col.split("_")[0]
            ts = f"{prefix}_ts"
            df_[ts] = pd.to_datetime(df_[col])
            df_[f"{prefix}_month"] = df_[ts].dt.month
            df_[f"{prefix}_weekday"] = df_[ts].dt.weekday
            df_[f"{prefix}_day"] = df_[ts].dt.day
            df_[f"{prefix}_hour"] = df_[ts].dt.hour
            df_[f"{prefix}_minute"] = df_[ts].dt.minute
            if not modeling: 
                df_[f"{prefix}_date"] = df_[ts].dt.date
                df_[f"{prefix}_dayname"] = df_[f"{prefix}_weekday"].apply(lambda x: day_names[x])
            else:
                df_.drop(columns=[ts, col], axis = 1)
        except:
            pass
    return df_

   

In [None]:
train = datetime_extract(train, ['pickup_datetime', 'dropoff_datetime'])
train.head(10)


In [None]:
train_time = extract_time_interval(train, 'delta_m', 'pickup_ts', 'dropoff_ts')
train.head(10)

In [None]:
#time series plot 
#count the passgeners during the day 
fig, ax = plt.subplots(ncols=2, figsize=(14, 5))
for i, col in enumerate(['pickup', 'dropoff']):
    ax[i].plot(train.groupby([f'{col}_date']).sum()['passenger_count'])
    ax[i].set(xlabel='Months', ylabel="Total passengers", title="Total passengers per date")

In [None]:
#Import Libraries
from bokeh.models import BoxZoomTool
from bokeh.plotting import figure, output_notebook, show
import datashader as ds
from datashader.bokeh_ext import InteractiveImage
from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, Hot, inferno,Set1
from datashader import transfer_functions as tf
output_notebook()

#plot datapoints by location coordinates
def plot_data_points(longitude,latitude,data_frame,focus_point) :
    #plot dimensions
    x_range, y_range = ((-74.14,-73.73), (40.6,40.9))
    plot_width  = int(750)
    plot_height = int(plot_width//1.2)
    export  = partial(export_image, export_path="export", background="black")
    fig = figure(background_fill_color = "black")    
    #plot data points
    cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height,
                    x_range=x_range, y_range=y_range)
    agg = cvs.points(data_frame,longitude,latitude,
                      ds.count(focus_point))
    img = tf.shade(agg, cmap= Hot, how='eq_hist')
    image_xpt  =  tf.dynspread(img, threshold=0.5, max_px=4)
    return export(image_xpt,"NYCT_hot")

plot_data_points('pickup_longitude', 'pickup_latitude',train,"passenger_count")

In [None]:
plot_data_points('dropoff_longitude', 'dropoff_latitude',train,"passenger_count")

# generate pairs of x-y values
theta = seq(-2 * pi, 2 * pi, length = 300)
x = cos(theta)
y = x + sin(theta) 
 
# set graphical parameters
op = par(bg = "white", mar = rep(0.1, 4))
 
# plot
plot(x, y, type = "n", xlim = c(-8, 8), ylim = c(-1.5, 1.5))
for (i in seq(-2*pi, 2*pi, length = 100))
{
  lines(i*x, y, col = hsv(runif(1, 0.85, 0.95), 1, 1, runif(1, 0.2, 0.5)), 
        lwd = sample(seq(.5, 3, length = 10), 1))          
}
 
# signature
legend("bottom", legend = "Sophie Champagne", bty = "n", text.col = "gray70")

**Reference:** 

* [FlowingData](https://flowingdata.com/) 

* [MOMA data visualization](https://www.moma.org/explore/inside_out/2015/12/10/data-visualization-design-and-the-art-of-depicting-reality/) 

* [Art made of data](https://www.ted.com/playlists/201/art_from_data) 

* [Modern Art Gallery](https://www.r-graph-gallery.com/portfolio/data-art/) 

