# Mass Transit: Project Part 1

It's your first day on the job. Your first task is to analyze the Mass Transit Data and present general insights. You will build histograms, boxplots and scatter plots, along with analyzing the mean, median, max, min and quartiles to present compelling insights and to look for interesting patterns.

Whenever you come across a cell with code in it, go ahead and run it. When you are prompted for your own solution, write your code in the blank Python cell and run that as well. If you get errors, don't worry, just go back and look at your code. Some errors are easily corrected. Other errors may need support from Stackoverflow or your classmates. Just be patient, and keep in mind that we all mistakes. The solutions will be made available to you.

### 1. Imports

Since you are working in Python in a Jupyter Notebook, you want to use the following standard imports.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### 2. Access Data

Next, you need to access the data as follows.

In [None]:
df = pd.read_csv('uber_every_1K_rows.csv', parse_dates=['Trip Start Timestamp', 'Trip End Timestamp'])

1. df is standard terminology. It stands for dataFrame. Throughout this project, whenever you see df, it's a reference to the data.

2. Notice that parse_dates=['Trip Start Timestamp', 'Trip End Timestamp']. This is very important. It allows us to treat these columns as dates. This will be useful later on.


### 3. View Data

It's always a good idea to view the data. .head() is a method that shows you the first five rows. Use .head() following df to view the first 5 rows in the cell below.

In [None]:
df.head()

Take a moment to examine the data. Lots of interesting stuff. Notice the NaN values? These need to be dealt with. First, let's analyze the data using .describe()

### 4. Describe Data

.describe() is a great method for viewing the core statistics at once. Use .describe() with df to view the core statistics in the cell below.

In [None]:
df.describe()

##### Write a paragraph stating insights from the data above.

SOLUTOIN .describe() shows the count, the quartiles, and the median mean. The mean is usually greater than the median in this dataset. This indicates that most of the data is very right skewed. This can happen when most of the data is clustered around 0, and the mean is pulled away from the median by very high outliers. As one example, the median trip is 3.6 miles, but the max trip is almost 246 miles. 

## 5. Column Info 

Use .info() as a method on df to view the info on the various columns.

In [None]:
df.info()

Whereas .describe() shows statistics for the numerical columns, .info() shows all the column and column types. It also reveals the null values.

##### How many columnns above contain null values?

In [None]:
10

There are many options for treating null values. The easiest is to get rid of them completely. This may not be possible if there are too many null values. Since our dataset is fairly large to begin with, we will take the easiest approach.

In [None]:
df = df.dropna(how='any',axis=0) 

To view the length of the new dataFrame, df, we can use the len() function as follows.

In [None]:
len(df)

## Visual Graphs

Now let's explore the graphs visually. It's always important to include visual graphs as part of your analysis. Graphs are easier to comprehend on sight, and they capture important information.

After every graph that is displayed, be sure to include a brief summary of what you see. The summary may include words like skew, normal, and modal (bi-modal, uni-modal). 

### 6. Histograms

We will start with histograms. Remember to use plt.hist(), and put the relevant columnn inside of parentheses. Then use plt.show() to display the graph.

##### Display a histogram of df['Fare'] in the cell below

In [None]:
plt.hist(df['Fare'])
plt.show()

INTERPRETATION - The fares are very right skewed. This is to be expected since there cannot be negative fares, and most riders take short trips. Very few rides are in excess of 60 dollars.

We can improve the visuals of our histogram using the following steps:

1. Use Seaborn - A popular Python library with nice defaults.
2. Make figure larger - Great for presentations later on.
3. Change number of bins - Can change interpretation of the graph.
4. Create labels for y and x - All graphs should have the axes labeled.
5. Give the histogram a title - All graphs should be titled.
6. Change the limit of x - Since many bars are invisible, limiting x can show more information about the distribution of larger values.

Let's do this now by running the next two cells.

In [None]:
# Run this cell to improve your visuals with Seaborn
import seaborn as sns;

# This sets up the seaborn dark grid
sns.set()

In [None]:
plt.figure(figsize=(10,7))
plt.xlim(0,60)
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Uber Fares')
plt.hist(df['Fare'], bins=25)
plt.savefig('Uber_Fares_Hist', dpi=300)
plt.show()

INTERPRETATION - The overall distribution of fares is right skeweed and unimodal with a peak at around $10. There are very few fares at over $40.

The code above this histogram may be copied and modified to generate new histograms of other columns. Also notice that we added the following line:

plt.savefig('Uber_Fares_Hist', dpi=300)

This will save the figure to your computer uder the title 'Uber_Fares_Hist'. The parameter dpi=300 means that the figure is saved with 300 dots per square inch, so it will look rather crisp. We recommend saving the figures that you create so that you can use them in the project presentation.

##### Display a histogram of df['Trip Seconds'] using only plt.hist() and plt.show()

In [None]:
plt.hist(df['Trip Seconds'])
plt.show()

##### Improve the histogram of df['Trip Seconds'] by making it larger and adding labels in the cell below. 
You also may want to change the number of bins and xlim. Be sure to interpret the graph in the provided cell.

In [None]:
plt.figure(figsize=(10,7))
plt.xlim(0,5000)
plt.xlabel('Seconds')
plt.ylabel('Frequency')
plt.title('Duration of Trip in Seconds')
plt.hist(df['Trip Seconds'], bins=40)
plt.show()

Trip Seconds peak at around 5-7 thousand. The distribution is unimodal and very right skewed. This indicates that most trips are about 10 minutes long. 

##### Display a histogram of df['Trip Seconds'] using only plt.hist() and plt.show()

In [None]:
plt.hist(df['Tip'])
plt.show()

##### Improve the histogram of df['Trip Seconds'] by making it larger and adding labels in the cell below.
You may want to change the number of bins and xlim. Be sure to interpret the graph in the provided cell.

In [None]:
plt.figure(figsize=(10,7))
plt.xlim(0,10)
plt.xlabel('Tips')
plt.ylabel('Frequency')
plt.title('Rider Tips')
plt.hist(df['Tip'], bins=30)
plt.show()

Most riders do not tip. The distribution of tips is right skewed, starting with a peak at 0, before swiftly dropping to \\$6. Very few riders tip more than \\$6.

##### Choose another column to explore on your own. 
The output should be a histogram saved to a file that is appropriately labeled.

In [None]:
plt.figure(figsize=(10,7))
plt.xlabel('Additional Charges')
plt.ylabel('Frequency')
plt.title('Additional Charges Per Trip')
plt.hist(df['Additional Charges'])
plt.show()

There is usually an additional fee of around \\$2 per trip. This distribution is not continuous. Additional fees may be under \\$2, or a  little beyond \\$8, with scattered values in between.

### 7. Box Plots

Now we will examine the same columns using Box Plots. Let's do the first one together.

##### Display a boxplot of df['Fare'].

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot(df['Fare'])
plt.title('Fares')
plt.show()

In a box plot, each circle represents an outlier. The presence of outliers distorts the box itself. Generally speaking, the box should be the main piece of the diagram, not the outliers. Viewing the outliers, however, gives a better sense of the distribution of the data. Let's zoom in on the box by changing the y-limit below.

##### Display a boxplot of df['Fare'] by zooming in.

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot(df['Fare'])
plt.title('Fares Zoomed In')
plt.ylim(-1,21)
plt.show()

INTERPRETATION - The median fare is around \\$7.50. The first and third quartiles, are about \\$5-\\$10. Rides beyond \\$17.50 are considered outliers, and the minimum is \\$0.

##### Display a boxplot of df['Trip Seconds'].

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot(df['Trip Seconds'])
plt.title('Seconds Per Trip')
plt.show()

##### Display a boxplot of df['Trip Seconds'] by zooming in.
Be sure to include your interpretation.

In [None]:
##### Display a boxplot of df['Trip Seconds'].

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot(df['Trip Seconds'])
plt.title('Seconds Per Trip Zoomed In')
plt.ylim(-10,2400)
plt.show()

INTERPRETATON - The median trip is about 750 seconds. The first and third quartiles are about 500 and 1200 seconds. Trips beyond 2300 seconds are considered outliers, and no trip can be less than 0 seconds.

##### Display a boxplot of df['Tip'].

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot(df['Tip'])
plt.title('Tips')
plt.show()

##### Display a boxplot of df['Tips'] by zooming in.
Be sure to include your interpretation in the cell below.

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot(df['Tip'])
plt.title('Tips Zoomed In')
plt.ylim(-1, 2.5)
plt.show()

All tips are considered outliers! Ouch.

### 8. Column Analysis - Tips

Occasionally, you will find peculiar information that needs further analysis. Let's consider the box plot of tips given above. Even with zooming in, the box plot indicates that all tips are outliers.

Some questions immediately arise. 

1. How are these outliers determined?
2. How likely is it that a driver receives a tip?
3. What is the mean tip?
4. What is the median tip?
5. Is there a way to earn more tips?

1. How are these outliers determined?

There is not a universal way to describe outliers. We know from the histogram that there are some fares with decent frequency, so we can conclude that the Python boxplot is determining outliers liberally. Let's check to see if it's 1.5 * IQR by examining the boxplot above.

##### What is the IQR of the Fare Box Plot above?
Set the value equal to the variable IQR in the cell below.

In [None]:
IQR = 5

##### Now multiply IQR by 1.5

In [None]:
IQR * 1.5

##### Go back to the boxplot and find the third quartile. Add 1.5 * IQR to the third quartile. What do you get?

In [None]:
17.5

So it appears that the boxplots do use 1.5 * IQR to calculate the outliers. So what's the issue? 

##### Use .describe() on df['Tip'] in the cell below. 
Then answer the question, Why are all tips outliers?' in the cell below.

In [None]:
df['Tip'].describe()

Since the first and third quartiles both have a value of 0, this makes IQR = 0, and 1.5 * IQR = 0. 

The next question is whether all tips should be outliers. It seems like it's important to get more information on tips. Let's start by calculating what percentage of riders actually tip. To do this, we need to figure out the number of tippers, and the number of rides. Execute the following two cells to reveal this information 

In [None]:
# Compute number of rides where there could have been a tip.
len(df['Tip'])

In [None]:
# Computer number of rides were there was a tip.
len(df[df['Tip'] != 0])

##### What percentage of riders tip?

In [None]:
len(df[df['Tip'] != 0])/len(df['Tip'])

##### Provide your own analysis. Should all tips be considered outliers?

Although the majority of riders do not tip, 18% of riders do tip. The histogram reveals that most tips are around \\$2, very close to the median of 0. Tipping in itself is not an outlier. Low tips are not uncommon enough. Very high tips may be considered outliers. The question is how high?

Outliers are open to interpretation. Let's follow the train of thought where all tips are not considered outliers. Outliers should be rare, and 0 is one reason why 1.5 * IQR is not a universal method to determine outliers.

Let's analyze the distribution of tippers. Let's work with a new dataFrame that only includes the 18% who tip. We can achieve this by running the following cell.

In [None]:
# Create DataFrame of Tippers
df_tippers = df[df['Tip'] != 0]

Now we can create boxplots and histograms using the new dataFrame. Instead of df, you should use df_tippers.

##### Display a histogram of df_tippers['Trip Seconds'] using only plt.hist() and plt.show()

In [None]:
plt.hist(df_tippers['Tip'])
plt.show()

##### Improve the histogram of df_tippers['Trip Seconds'] by making it larger and adding labels in the cell below.
You may want to change the number of bins and xlim. Be sure to interpret the graph in the provided cell.

In [None]:
plt.figure(figsize=(10,7))
plt.xlim(0,15)
plt.xlabel('Tips')
plt.ylabel('Frequency')
plt.title('Rider Tips')
plt.hist(df_tippers['Tip'], bins=30)
plt.show()

This histogram gives a much clearer distribution of tips. The tips distribution is right-skewed and unimodal with a peak of around \\$2. Most riders tip less than \\$6, and very few riders tip more than \\$11.

##### Display a boxplot of df['Tip'].

In [None]:
plt.figure(figsize=(10,7))
plt.boxplot(df_tippers['Tip'])
plt.title('Tips')
plt.show()

##### Display a boxplot of df_tippers['Tip'] zoomed in.
Be sure to include your interpretation in the cell below.

In [None]:
plt.figure(figsize=(10,7))
plt.ylim(0.5, 7.5)
plt.boxplot(df_tippers['Tip'])
plt.title('Tips')
plt.show()

When restricting the range to tippers, the median tip is \\$2, and the quartile range is \\$1-\\$3. Tips greater than $6 are considered outliers.

The boxplot indicates that tips over \\$6 should be considered outliers, while the histogram suggests that tips beyond \\$11 are rare. As a final piece of analysis, we can examine some quantiles by running the next four cells.

In [None]:
# The 90th percentile of tippers
df_tippers['Tip'].quantile(0.9)

In [None]:
# The 90th percentile of riders
df['Tip'].quantile(0.9)

In [None]:
# The 99th percentile of tippers
df_tippers['Tip'].quantile(0.99)

In [None]:
# The 99th percentile of tippers
df['Tip'].quantile(0.99)

##### Give your final justification of what tips should be considered outliers.

99% of all riders will not tip, or tip less than $6. It's safe to say that rides over $6 are rare enough to be outliers, as evidenced in the boxplot of tippers. 

##### What was the max tip? 
Use .max on df['Tip']

In [None]:
df['Tip'].max()

### 9. Scatter Plots

Scater plots take two numeric columns, and plot them on the x and y axes. Scatter plots are particularly useful for examining the relationship between columns, a topic that we will investigate in further detail later in the course. For now, let's examine a variety of scatter plots to develop an intuition for how the data is related.

First, let's recall the columns at are disposal along with their types by running the cell below.

In [None]:
df.info()

Every column with  float64 and int64 at the end are numeric types. Notice that this includes latitude and longitude, along with community areas in addition to some of the columns we have already examined.

We can think of the y axis as the column that we are trying to predict, and the x axis as our independent variable. The general syntax is as follows: 

plt.scatter(x,y)

As a starting point, let's run the next few cells to determine if the pickup centroid latitude can influence the tip.

##### Create a scatter plot of df['Pickup Centroid Latitude'], and df['Tip']

In [None]:
plt.scatter(df['Pickup Centroid Latitude'], df['Tip'])
plt.show()

A couple of comments. First, the density of points makes it difficult to determine the amount of overlapping values. We can use alpha as a parameter that will provide a greater degree of transparency. If we set alpha=0.4 this will allow for 40% transparency.

We can also make the scatter plot larger and add labels.

##### Create a scatter plot of df['Pickup Centroid Latitude'], and df['Tip'] with a larger plot, alpha = 0.4 and labels.

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(df['Pickup Centroid Latitude'], df['Tip'], alpha = 0.4)
plt.xlabel('Latitude')
plt.ylabel('Tip')
plt.title('Estimating Tip From Latitude of Pickup')
plt.show()

INTERPRETATION - It appears that the pickup locations of latitudes are likely to result in larger tips. This is an intuition based on the plot that will be investigated at a later time.

##### Create a scatter plot of df['Pickup Centroid Longitude'], and df['Tip'] with a large plot, alpha = 0.4 and labels.
Be sure to include your interpretation in the provided cell.

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(df['Pickup Centroid Longitude'], df['Tip'], alpha = 0.4)
plt.xlabel('Longitude')
plt.ylabel('Tip')
plt.title('Estimating Tip From Longitude of Pickup')
plt.show()

It also appears that certain pickup longitudes result in higher tips. 

##### Create a scatter plot of df['Trip Miles'], and df['Tip'] with a large plot, alpha = 0.4 and labels.
Be sure to include your interpretation in the provided cell.

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(df['Trip Miles'], df['Tip'], alpha = 0.4)
plt.xlabel('Miles')
plt.ylabel('Tip')
plt.title('Estimating Tip From Trip Miles')
plt.show()

It does not appear that longer trips result in bigger tips.

##### Create a scatter plot of df['Trip Miles'], and df['Fare'] with a large plot, alpha = 0.4 and labels.
Be sure to include your interpretation in the provided cell.

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(df['Trip Miles'], df['Fare'], alpha = 0.4)
plt.ylabel('Fare')
plt.xlabel('Trip Miles')
plt.title('Estimating Fare From Trip Miles')
plt.show()

There is a positive linear associate between Trip Miles and Fare. This is to be expected since the distance of the trip determines how much one is supposed to pay.

##### Create a scatter plot of df['Fare'], and df['Trip Total'] with a large plot, alpha = 0.4 and labels.
Be sure to include your interpretation in the provided cell.

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(df['Fare'], df['Trip Total'], alpha = 0.4)
plt.xlabel('Fare')
plt.ylabel('Trip Total')
plt.title('Estimating Trip Total From Fare')
plt.show()

It's not suprising that Trip Total can be predicted rather well from the Fare. It must exceed the fare depending on the tip and other fees. 

#### Bonus Tip: You may add a straight line as a baseline of comparison
This is useful since no tip will result in the lowest possible value, when the Trip Total equals the fare.

In [None]:
# Plot of Fare and Trip Total
plt.figure(figsize=(10,7))
plt.scatter(df['Fare'], df['Trip Total'], alpha = 0.4)
plt.xlabel('Fare')
plt.ylabel('Trip Total')
plt.title('Estimating Trip Total From Fare')

# Plot of straight line y = x
x = np.linspace(0,125,75)
y = x
plt.scatter(x, y, alpha=0.6, marker=".", linewidths=0.5)

# Legend to distinguish between plots
plt.legend(loc='upper left')

# Show both plots
plt.show()

##### Create a scatter plot of df['Pickup Centroid Latitude'], and df['Pickup Centroid Longitude'].
Be sure to include your interpretation in the provided cell.

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(df['Pickup Centroid Latitude'], df['Pickup Centroid Longitude'], alpha=0.4)
plt.show()

This appears to be a map of Chicago, if rotated 90 degrees!

## 10. Time Series Analysis

There are a couple of time columns in the dataFrame. We can use .info on df to recall which ones.

##### Use .info on df to show all columns and types.

In [None]:
df.info()

Trip Start Timestamp and Trip End Timestamp are both time columns as evidenced by the datetime64[ns] listed at the end of the row. Let's view one of the time columns by running the following cell.

In [None]:
df['Trip End Timestamp'].head()

As you can see, the column includes both dates, and times.

It can be interesting to create new columns that include the hour and the day of the week as baselines of comparison. This can be done whenever there is a datetime column in Python as follows.

In [None]:
import datetime as dt

df['month'] = df['Trip Start Timestamp'].dt.month
df['hour'] = df['Trip Start Timestamp'].dt.hour
df['dayofweek'] = df['Trip Start Timestamp'].dt.dayofweek

##### View the first five rows of the new dataFrame using .head().
Scroll to the right to view the new columns.

In [None]:
df.head()

Note that the dayofweek column starts with 0, which is a Monday.

##### Which day of the week has the most rides? The least rides?
We can answer this question using a groupby as follows.

In [None]:
df.groupby(['dayofweek']).count()

Most rides are on a Saturday, and the least rides are on a Monday.

##### What are the most popular times to get a ride?
Use a groupby in the next cell. Answer the question directly in the cell that follows. 

In [None]:
df.groupby(['hour']).count()

The most popular time is around 6pm (18). The least popular is 4am.

##### Create a histogram of df['hour'], then answer the following questions.
1. Describe the data in terms of hours.
2. What are the best time intervals to look for riders?

In [None]:
plt.figure(figsize=(10,7))
plt.hist(df['hour'], bins=24)
plt.show()

Rides start to pick up at around 7 in the morning with a mini-peak at 8am. Rides then level off from 10-2. They really pick up around 4pm, peaking at 6, before tapering off at 8. There are plenty of night rides before a sudden decline arond midnight. The best time to look for rides is 5-7pm, or 4-11pm.

The histogram above is periodic. This means that it does not start and stop, but it keeps going. Let's take a look at another periodic histogram, day of week.

##### Create a histogram of df['dayofweek'], then answer the following questions.
1. What is the pattern of rides throughout the week? 
2. What is the best interval of days to look for riders?

In [None]:
plt.figure(figsize=(10,7))
plt.hist(df['dayofweek'], bins=7)
plt.show()

Mon. is the slowest day, with ridership slightly increasing on Tues. and Wed. It picks up on Thurs. before peaking on Fri. and Sat. After the steady increase in rides, there is a sharp drop-off on Sunday. Thursday through Saturday are the best day to find riders.

##### Do riders tip more or less on average on certain days?
Use a groupby with .mean() and find the 'Tip' column. 

In [None]:
df.groupby(['dayofweek']).mean()

Riders tip the most on average on Monday, and the least on Saturday.

##### Do riders tip more or less on average during certain hours?

In [None]:
df.groupby(['hour']).mean()

It looks like 4am riders tip the most, and 2am riders tip the least.