<div style="border:solid blue 2px; padding: 20px">

 **Overall Summary of the Project**

Dear Sang Le,

Congratulations on completing your data analysis project for Zuber! Your notebook is well-organized and provides a comprehensive analysis of taxi ride patterns in Chicago. You've effectively structured your project with clear sections, detailed explanations, and appropriate visualizations. Your work demonstrates a solid understanding of data manipulation, visualization, and statistical hypothesis testing.

---

<div style="border-left: 7px solid green; padding: 10px;">
<b>✅ Strengths:</b>
<ul>
  <li><b>Data Loading and Preparation:</b> You efficiently loaded the datasets and performed necessary data cleaning, including converting data types and checking for duplicates. Your attention to detail in data preprocessing is commendable.</li>
  <li><b>Exploratory Data Analysis:</b> You conducted thorough EDA by identifying the top neighborhoods and taxi companies. Your visualizations effectively highlight key insights, such as the most popular drop-off locations and the taxi companies with the highest trip counts.</li>
  <li><b>Visualization Skills:</b> Your use of Seaborn and Matplotlib for visualizations is effective. The bar charts clearly display the distribution of trips, and the boxplot provides a good overview of ride durations across weather conditions.</li>
  <li><b>Hypothesis Testing:</b> You correctly formulated the null and alternative hypotheses and applied the two-sample t-test appropriately to compare ride durations between rainy and non-rainy Saturdays.</li>
  <li><b>Interpretation of Results:</b> You provided a clear interpretation of your statistical test results, explaining the implications for Zuber's operations.</li>
  <li><b>Overall Presentation:</b> Your code is structured logically, and your markdown explanations guide the reader through each step of your analysis effectively.</li>
</ul>
</div>

<div style="border-left: 7px solid gold; padding: 10px;">
<b>⚠️ Areas for Improvement:</b>
<ul>
  <li><b>Formatting and Consistency:</b> Ensuring consistent formatting throughout your notebook, such as uniform use of headings and consistent spacing, can enhance the overall presentation of your work.</li>
  <li><b>Additional Insights:</b> While your interpretations are insightful, including more detailed recommendations based on your findings could further strengthen your conclusions. For example, suggesting specific strategies Zuber could implement to address longer ride durations on rainy Saturdays.</li>
</ul>
</div>

<div style="border-left: 7px solid red; padding: 10px;">
<b>⛔️ Critical Changes Required:</b>
<ul>
    None ;)
</ul>
</div>

---

**Conclusion**

Your project is thorough and well-executed, demonstrating strong analytical skills and attention to detail. You have effectively addressed the critical issue from the previous review by correctly applying the two-sample t-test and providing a detailed interpretation of your findings. Your analysis provides valuable insights that can help Zuber make informed business decisions.

Great job! Your project meets all the requirements, and we are happy to approve it.

**Next Steps**

- **Enhance Recommendations:**
  - Based on your findings, provide more specific recommendations for Zuber. For instance, suggest increasing the number of drivers on rainy Saturdays or implementing strategies to reduce ride durations during bad weather.

- **Improve Formatting Consistency:**
  - Ensure uniform use of heading levels and consistent spacing between sections to enhance the professional appearance of your notebook.

- **Explore Additional Analyses:**
  - Consider exploring other factors that might influence ride durations, such as time of day or specific locations within the top neighborhoods, to provide a more comprehensive analysis.

If you have any questions or need further assistance in future projects, please feel free to reach out. Keep up the excellent work!

Best regards,

Matías

</div>

# Importing

In [None]:
# importing libraries

import pandas as pd
import math
from scipy import stats as st
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# reading in datasets 

rides_per_company = pd.read_csv('/datasets/project_sql_result_01.csv')

trips_per_location = pd.read_csv('/datasets/project_sql_result_04.csv')

loop_ohare = pd.read_csv('/datasets/project_sql_result_07.csv')

In [None]:
# reading in datasets 

rides_per_company = pd.read_csv('/datasets/project_sql_result_01.csv')

trips_per_location = pd.read_csv('/datasets/project_sql_result_04.csv')

loop_ohare = pd.read_csv('/datasets/project_sql_result_07.csv')
# inspecting dataframes

rides_per_company.info()

rides_per_company.head()

In [None]:
trips_per_location.info()

trips_per_location.head()

In [None]:

loop_ohare.info()

loop_ohare.head()

Observations

Data types look fine, aside from start_ts in loop_ohare being an object -- we'll want to turn that into datetime in case we need such calculations in the future.

# Cleaning Data

In [None]:
loop_ohare['start_ts'] = pd.to_datetime(loop_ohare['start_ts'], format='%Y-%m-%d %H:%M:%S')

loop_ohare.info()
loop_ohare.head()

# Exploratory Data Analysis

Identifying the top 10 neighborhoods by drop-offs

In [None]:
# sorting our trips_per_location df and looking at the top 10 neighborhoods

trips_per_location = trips_per_location.sort_values(by='average_trips', ascending=False)

trips_per_location.head(10)

In [None]:

# plotting a boxplot of the top ten dropoff locations

top_locations = trips_per_location.head(10)

sns.catplot(kind='bar',
            data=top_locations,
            x='dropoff_location_name',
            y='average_trips',
            height=5,
            aspect=2
           )

plt.title('Average Dropoffs Per Location')
plt.xlabel('Location')
plt.ylabel('Average Trips')
plt.xticks(rotation = 25)
plt.show()

Observations

We can clearly see that Loop is the most common dropoff location. The interesting part is how much of a lead the top 4 locations have over the rest of the top 10.

Loop, River North, Streeterville and West Loop are all significantly more popular than the rest of the locations, which taper off to roughly similar amounts.

Without having intimiate knowledge of the Chicago area, we may be able to conclude that these 4 locations are very important to travels, because they go there very frequently.

Identifying top taxi companies

In [None]:
# sorted out rides_per_company df to identify the top 10 taxi companies by number of rides

rides_per_company = rides_per_company.sort_values(by='trips_amount', ascending=False)

rides_per_company.head(10)

In [None]:

# plotting this, similar to above

top_companies = rides_per_company.head(10)

sns.catplot(kind='bar',
            data=top_companies,
            x='company_name',
            y='trips_amount',
            height=5,
            aspect=2
           )

plt.title('Trips per Taxi Company')
plt.xlabel('Company')
plt.ylabel('Number of Trips')
plt.xticks(rotation = 80)
plt.show()

Observations

Flash Cab is wildly more popular than all the rest of the taxi companies -- nearly double the trips! The other 9 are all much more similar, with a gradual decline between number 2 and number 9. This tells us just how much bigger and more popular Flash Cab is than all the rest.

Testing Hypotheses
We will be testing the following hypothesis, using the data we generated from the database, stored in our loop_ohare dataframe:

"The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays."

Null Hypothesis: The average duration of rides from the Loop to O'Hare International Airport does not change on rainy Saturdays.

Alternative Hypothesis: The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.

In [None]:
# creating a day_of_week column to identify only saturdays
loop_ohare["day_of_week"] = loop_ohare["start_ts"].dt.dayofweek

print(loop_ohare["day_of_week"].value_counts())

In [None]:
# splitting our loop_ohare dataframe into two samples, for good and rainy weather

good_saturdays = loop_ohare.query('weather_conditions == "Good"')

bad_saturdays = loop_ohare.query('weather_conditions == "Bad"')

display(good_saturdays.head())
display(bad_saturdays.head())

In [None]:
# testing our hypothesis

alpha = 0.05 

results = st.ttest_ind(good_saturdays['duration_seconds'], bad_saturdays['duration_seconds'], equal_var=False)

print('p-value:', results.pvalue)

if (results.pvalue < alpha):
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

Observations

After running our test, we have sufficient evidence to reject the null hypothesis, meaning that there is a statistically significant difference between the average duration of rides from Loop to O'Hare on rainy Saturdays vs good weather Saturdays.

We do not yet know what the difference is, but we can certainly find out.

Regarding the test, I chose an alpha value of 0.05, because it is the popularly accepted significance level for hypothesis testing. I used st.ttest_ind because we are testing the means of two independent samples, that do not have an equal variance between them.

In [None]:
# looking at the average duration of good weather rides and bad weather rides

good_average = good_saturdays['duration_seconds'].mean()
bad_average = bad_saturdays['duration_seconds'].mean()

print('Average ride duration on good weather Saturdays:', good_average)
print('Average ride duration on bad weather Saturdays:', bad_average)

In [None]:
# plotting this to understand the magnitude in difference
d={'weather_conditions': loop_ohare['weather_conditions'], 'duration_seconds': loop_ohare['duration_seconds']}
loop_ohare_dist = pd.DataFrame(data=d)

sns.catplot(kind='box',
           data=loop_ohare_dist,
           x='duration_seconds',
           y='weather_conditions',
            height=5,
            aspect=3
           )

plt.title('Ride Duration According to Weather Conditions')
plt.xlabel('Ride Duration in Seconds')
plt.ylabel('Weather Conditions')
plt.show()

Observations

Looking at both our mean data and the box plot, we can confidently conclude that the average ride duration is longer on rainy Saturdays.

Conclusion
Throughout this project, I followed through on the full process of scraping the data from a website, querying it once it was stored in an organized SQL database, and analyzing the database exports using Python.

I was able to pull some helpful data slices using SQL so that we could further analyze that data using Python, in a useful way.

I was able to find the top 10 Taxi companies, which was unanimously Flash Cab, by a landslide. They are significantly more popular than any other Taxi company.

I identified the most popular dropoff locations, with the top 4 of those locations, Loop, River North, Streeterville and West Loop, all being significantly more popular than the rest of the top 10. We concluded earlier that those locations are likely important to the average traveler, since so many people end up there.

Finally, I tested the hypothesis that the average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.

I created the following Null and Alternative hypotheses:

Null Hypothesis: The average duration of rides from the Loop to O'Hare International Airport does not change on rainy Saturdays.

Alternative Hypothesis: The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.

The results of the test led me to being able to reject the Null hypothesis, which I then dug deeper to figure out the real story of the data. we can confidently conclude that the average ride duration is longer on rainy Saturdays.