<a href="https://colab.research.google.com/github/sadansabo/AI/blob/main/Analyzing%20New%20York%20Taxi%20Data%20Using%20PySpark%20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Analyzing New York Taxi Data Using PySpark on Azure (Free Tier)**

#### **Introduction**
You are a data engineer at a transportation analytics company. Your team has been tasked with analyzing the New York City taxi dataset to uncover insights that can help improve taxi services, optimize routes, and understand customer behavior. The dataset is large, but you decide to use PySpark on Azure Databricks to handle the data processing efficiently.

---

### **Step 1: Setting Up the Environment**
**Story:**  
You start by setting up your cloud environment. Azure Databricks is chosen because it provides a collaborative, scalable, and managed platform for big data analytics with PySpark. You create a Databricks workspace and a small cluster, and configure it to read data from an Azure Blob Storage account where the dataset is stored.

**Mini Tasks:**
1. Create a free Azure account if you don’t already have one.
2. Set up an Azure Blob Storage account and upload a subset of the New York taxi dataset (e.g., `trip_data.csv` and `trip_fare.csv`).
3. Create an Azure Databricks workspace and launch a small cluster with PySpark installed.
4. Mount the Blob Storage container to your Databricks workspace using the `abfss` protocol or a SAS token.
5. Verify that PySpark is working by running a simple script to print "Hello, PySpark!"

---

### **Step 2: Loading and Exploring the Data**
**Story:**  
With the environment ready, you load the dataset into PySpark DataFrames. The dataset contains information about taxi trips, including pickup/dropoff locations, timestamps, trip distances, fares, and payment types. You begin by exploring the data to understand its structure and identify any quality issues.

**Mini Tasks:**
1. Load a subset of the `trip_data.csv` and `trip_fare.csv` files from Blob Storage into PySpark DataFrames.
2. Print the schema of both DataFrames to understand the column names and data types.
3. Check for missing values and duplicates in the dataset.
4. Display the first 5 rows of each DataFrame to get a sense of the data.

---

### **Step 3: Data Cleaning and Transformation**
**Story:**  
You notice some inconsistencies in the data, such as missing values in the `passenger_count` column and incorrect timestamps. You decide to clean the data by removing invalid records and transforming columns into the correct formats.

**Mini Tasks:**
1. Drop rows with missing values in critical columns like `pickup_datetime`, `dropoff_datetime`, and `trip_distance`.
2. Convert the `pickup_datetime` and `dropoff_datetime` columns to timestamp format.
3. Filter out rows where `trip_distance` is 0 or negative.
4. Create a new column `trip_duration` by calculating the difference between `dropoff_datetime` and `pickup_datetime`.

---

### **Step 4: Joining Datasets**
**Story:**  
To get a complete picture of each trip, you decide to join the `trip_data` and `trip_fare` DataFrames using a common key, such as `medallion` and `hack_license`. This will allow you to analyze both trip details and fare information together.

**Mini Tasks:**
1. Identify the common columns between `trip_data` and `trip_fare` DataFrames.
2. Perform an inner join on the two DataFrames using the common columns.
3. Verify the join by checking the row count and inspecting the resulting DataFrame.

---

### **Step 5: Analyzing the Data**
**Story:**  
With the cleaned and joined dataset, you start analyzing it to uncover insights. You focus on answering key questions, such as:
- What is the average trip duration and fare?
- Which payment methods are most popular?
- What are the peak hours for taxi rides?

**Mini Tasks:**
1. Calculate the average `trip_duration` and `fare_amount`.
2. Group the data by `payment_type` and count the number of trips for each payment method.
3. Extract the hour from `pickup_datetime` and create a histogram to visualize peak hours.
4. Identify the top 5 pickup and dropoff locations based on trip frequency.

---

### **Step 6: Visualizing the Results**
**Story:**  
To make your findings more accessible, you decide to visualize the results using Databricks' built-in visualization tools or a Python library like Matplotlib or Seaborn. You export the aggregated data from PySpark to a Pandas DataFrame for easier plotting.

**Mini Tasks:**
1. Export the results of your analysis (e.g., average fare, peak hours) to a Pandas DataFrame.
2. Create a bar chart to show the distribution of payment types.
3. Plot a line chart to visualize the number of trips per hour (peak hours).
4. Generate a scatter plot to explore the relationship between `trip_distance` and `fare_amount`.

---

### **Step 7: Saving the Results**
**Story:**  
Finally, you save the cleaned dataset and analysis results back to Blob Storage for future use. You also document your findings in a report that will be shared with your team.

**Mini Tasks:**
1. Save the cleaned and joined DataFrame as a Parquet file in your Blob Storage account.
2. Export the aggregated results (e.g., peak hours, payment types) as a CSV file to Blob Storage.
3. Write a brief summary of your findings in a Markdown file and upload it to Blob Storage.

---

### **Conclusion**
**Story:**  
You’ve successfully built a PySpark pipeline to analyze the New York taxi dataset on Azure. Your analysis revealed valuable insights, such as peak hours, popular payment methods, and average trip durations. These findings will help your team make data-driven decisions to improve taxi services in New York City.

**Mini Tasks:**
1. Terminate the Databricks cluster to avoid unnecessary costs.
2. Share the Blob Storage links to the cleaned data, analysis results, and report with your team.

---

### **Bonus Challenge**
- Use PySpark MLlib to build a simple machine learning model that predicts taxi fares based on features like `trip_distance` and `trip_duration`.
- Explore geospatial analysis by visualizing pickup and dropoff locations on a map using libraries like Folium or Databricks' built-in mapping tools.

---

### **Tips for Staying Within Free Tier Limits**
1. Use a **small cluster** in Databricks and terminate it immediately after completing tasks.
2. Work with a **subset of the dataset** (e.g., 1-2 months of data) to reduce processing time and resource usage.
3. Use **compressed file formats** like Parquet or GZIP to save storage space in Blob Storage.
4. Monitor your usage in the Azure portal to ensure you stay within the free tier limits.

---

This project provides a hands-on experience with PySpark, Azure Databricks, and data analysis.