# Key Statistical Concepts


### Mean
The mean is the average of a dataset and is calculated by summing all the values and dividing by the total number of observations. It provides a central value of the data.

- Example:

In a data analytics scenario, suppose we have the following sales data for a week (in thousands of dollars): [10, 15, 20, 25, 30]. The mean or average will be 20

```python
sales = [10, 15, 20, 25, 30]
mean_sales = sum(sales) / len(sales)
print(f"Mean Sales:{mean_sales}") # Output: Mean Sales: 20.0
```
### Median
The median is the middle value of a dataset when sorted in ascending order. It is useful in datasets that may contain outliers.

- Example:

Consider customer ages: [22,25,29,50,35], the median age, the one in the middle is 29

```python
ages = [22, 25, 29, 50, 35]
sorted_ages = sorted(ages)
n = len(sorted_ages)
median_age = sorted_ages[n // 2] if n % 2 != 0 else (sorted_ages[n // 2 - 1] + sorted_ages[n // 2]) / 2
print(f"Median Age: {median_age}") # Output: Median Age: 29
```

### Mode
The mode is the value that appears most frequently in a dataset. It is particularly useful for categorical data.

Example:
In a survey about favorite fruits, the responses are:
["Apple","Banana","Apple","Orange","Banana","Banana"]
The mode, or the most frequent value is "Banana"

```python
from statistics import mode

fruits = ["Apple", "Banana", "Apple", "Orange", "Banana", "Banana"]
mode_fruit = mode(fruits)
print(f"Mode of Fruits:{mode_fruit}") # Output: Mode of Fruits: Banana
```
### Variance
Variance measures how far each number in the dataset is from the mean and indicates the degree of spread in the data.

- Example:

In analyzing test scores: [80,85,90,95,100]

Mean =
$\frac{80 + 85 + 90 + 95 + 100}{5}$ = $\frac{450}{5}$ = 90

Calculate the Squared Differences from the Mean:

- $(80-90)^2 = 100$
- $(85-90)^2 = 25$
- $(90-90)^2 = 0$
- $(95-90)^2 = 25$
- $(100-90)^2 = 100$

Sum the Squared Differences:
100+25+0+25+100=250

Calculate the Variance (using sample variance, which divides by $n−1$):

Variance = $\frac{250}{5 - 1}$ = $\frac{250}{4}$ = 62.5


```python
scores = [80, 85, 90, 95, 100]
mean_score = sum(scores) / len(scores)
variance = sum((x - mean_score) ** 2 for x in scores) / (len(scores) - 1)
print(f"Variance of Scores: {variance}") # Output: Variance of Scores: 62.5
```
### Standard Deviation
The standard deviation is the square root of the variance and provides a measure of the average distance of each data point from the mean.

- Example:

Using the same test scores:
[80,85,90,95,100] the standard deviation is $\sqrt{62.5}$ = 7.9

```python
import math

std_dev = math.sqrt(variance)
print(f"Standard Deviation of Scores: {std_dev}") # Output: Standard Deviation of Scores: 7.905694150420948
```
### Quantiles
Quantiles divide a dataset into equal-sized subsets. For instance, quartiles divide the data into four equal parts.
Quantiles are useful to identify outliers in the data

- Example:

Consider the following exam scores:
[60,70,80,90,100]

```python
scores = [60, 70, 80, 90, 100]
sorted_scores = sorted(scores)

# Calculate quartiles
q1 = sorted_scores[int(len(sorted_scores) * 0.25)]
q2 = sorted_scores[int(len(sorted_scores) * 0.5)]  # Median
q3 = sorted_scores[int(len(sorted_scores) * 0.75)]

print(f"1st Quartile (Q1):{q1}") # Output: 1st Quartile (Q1):70
print(f"Median (Q2):{q2}") # Output: Median (Q2):80
print(f"3rd Quartile (Q3):{q3}") # Output: 3rd Quartile (Q3):90
```

In [None]:
# use this as a playground to experiment with statistical concepts

sales = [10, 15, 20, 25, 30]
mean_sales = sum(sales) / len(sales)
print(f"Mean Sales:{mean_sales}")

ages = [22, 25, 29, 50, 35]
sorted_ages = sorted(ages)
n = len(sorted_ages)
median_age = sorted_ages[n // 2] if n % 2 != 0 else (sorted_ages[n // 2 - 1] + sorted_ages[n // 2]) / 2
print(f"Median Age: {median_age}")

from statistics import mode
fruits = ["Apple", "Banana", "Apple", "Orange", "Banana", "Banana"]
mode_fruit = mode(fruits)
print(f"Mode of Fruits: {mode_fruit}")

scores = [80, 85, 90, 95, 100]
mean_score = sum(scores) / len(scores)
variance = sum((x - mean_score) ** 2 for x in scores) / (len(scores) - 1)
print(f"Variance of Scores: {variance}")

import math
std_dev = math.sqrt(variance)
print(f"Standard Deviation of Scores: {std_dev}")

scores = [60, 70, 80, 90, 100]
sorted_scores = sorted(scores)

# Calculate quartiles
q1 = sorted_scores[int(len(sorted_scores) * 0.25)]
q2 = sorted_scores[int(len(sorted_scores) * 0.5)]  # Median
q3 = sorted_scores[int(len(sorted_scores) * 0.75)]

print(f"1st Quartile (Q1):{q1}")
print(f"Median (Q2):{q2}")
print(f"3rd Quartile (Q3):{q3}")

#Dataframes

In Python, dataframes are widely used data structures that represent data in a tabular form—organized in rows and columns, much like a spreadsheet or a SQL table. Dataframes make data analysis, manipulation, and visualization much easier, especially when working with large datasets.

**What is a DataFrame?**

A DataFrame is essentially a two-dimensional, labeled data structure with columns of potentially different types. This versatility allows users to work with heterogeneous data types within a single structure, making dataframes very powerful for data science and machine learning.

**Key characteristics of a DataFrame:**

- Two-dimensional: It has rows and columns, which gives it a tabular structure.
- Labeled axes: Rows and columns can be labeled, making it easy to reference parts of the DataFrame by name rather than by position.
- Flexible data types: Each column can hold a different data type (e.g., integers, floats, strings, dates).




## Types of DataFrames in Python
While pandas is the most widely used library for dataframes in Python, other libraries implement dataframes with additional functionality or optimizations tailored for specific use cases.

Here are some of the primary types of dataframes in Python:

| **DataFrame Type** | **Best Use Case**                                                  | **Memory Usage**                           | **Description**                                                                                 |
|--------------------|--------------------------------------------------------------------|--------------------------------------------|-------------------------------------------------------------------------------------------------|
| **[Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)**         | Small to medium datasets that fit in memory                       | In-memory                                  | Standard Python library for data manipulation and analysis.                                     |
| **[Dask](https://docs.dask.org/en/stable/dataframe.html)**           | Large datasets that exceed memory limits, single-machine parallel | Out-of-core (chunk-based, parallel)        | Extension of pandas for handling large data, processing in parallel.                            |
| **[Koalas/PySpark](https://koalas.readthedocs.io/en/latest/reference/frame.html#constructor)** | Big data processing on distributed cluster with Spark             | Distributed (cluster-based)                | Spark-compatible dataframe for big data, scalable to clusters.                                  |
| **[Modin](https://modin.readthedocs.io/en/latest/usage_guide/index.html)**          | Pandas-compatible workflows with faster, parallel execution       | In-memory (parallel)                       | Parallelized pandas replacement for faster execution.                                           |
| **[Polars](https://docs.pola.rs/)**         | Performance-critical applications on large datasets               | In-memory (optimized for speed)            | Rust-based dataframe optimized for speed, available in Python.                                  |
| **[Vaex](https://vaex.io/docs/api.html)**           | Extremely large datasets without loading all data into memory     | Out-of-core (efficient memory usage)       | Handles large datasets out-of-core; efficient for exploration and stats.                        |



<div style="display: flex; justify-content: space-around;">
    <img src="https://pandas.pydata.org/static/img/pandas.svg" alt="Pandas Logo" style="height: 50px; max-height: 50px;">
    <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTP5KVvipePSkKXNx0CLGxNfV2hnEdm13gPMA&s" alt="Dask Logo" style="height: 50px; max-height: 50px;">
    <img src="https://koalas.readthedocs.io/en/latest/_static/koalas-logo-docs.png" alt="Koalas Logo" style="height: 50px; max-height: 50px;">
    <img src="https://modin.readthedocs.io/en/latest/_images/MODIN_ver2_hrz.png" alt="Modin Logo" style="height: 50px; max-height: 50px;">
    <img src="https://raw.githubusercontent.com/pola-rs/polars-static/master/logos/polars-logo-dimmed-medium.png" alt="Polars Logo" style="height: 50px; max-height: 50px;">
    <img src="https://vaex.io/docs/_static/logo-grey.svg" alt="Vaex Logo" style="height: 50px; max-height: 50px;">
</div>





## Pandas

Pandas is a powerful and widely-used library in Python for data manipulation and analysis. It provides flexible and efficient data structures, primarily
**Series** (1-dimensional) and **DataFrame** (2-dimensional), that make it easy to work with structured data. The DataFrame is especially valuable for data analytics as it allows you to store, filter, transform, and analyze datasets much like a spreadsheet or SQL table.

- Data Loading and Cleaning: Pandas can load data from various sources (CSV, Excel, SQL databases, etc.) and provides tools for handling missing values, duplicates, and other inconsistencies.
- Data Transformation: You can easily filter rows, select columns, group data, and apply complex transformations.
- Data Aggregation and Summary Statistics: Pandas provides efficient ways to compute summary statistics (mean, sum, median, etc.) for each group or column.
- Data Visualization Integration: Pandas integrates well with libraries like Matplotlib and Seaborn for quick visualizations.

Let’s go through some examples that demonstrate common tasks in data analytics using Pandas.

> For more details on especific functions [read the docs](https://pandas.pydata.org/docs/reference/index.html)


### Installing pandas

Installing pandas depends on the environment you're using.

1. In Jupyter notebooks, if you need to update pandas to a specific version, you can do the following:

  ```python
  # Install or update pandas to last version
  !pip install pandas --upgrade
  ```
  Run this in a Colab cell, and it will install or update pandas to the latest version.

  You can install a especific version of pandas using the `==` syntax.
  ```python
  # Install a certain version of pandas
  !pip install pandas==1.5.3
```
> **Google Colab** comes with pandas pre-installed, so you typically don't need to install it manually.

2. Installing Pandas in Jupyter Notebook or Local Python Environment
If you’re working locally in a Jupyter Notebook or any Python environment (like PyCharm, VS Code, etc.), you can use `pip to install pandas`.

  Open a terminal and run:

  ```bash
  pip install pandas
  ```
  If you want a specific version of pandas, specify it as follows:

  ```bash
  pip install pandas==1.5.3
  ```

3. Installing Pandas in Anaconda
If you’re using Anaconda, it’s often best to install pandas via the `conda` package manager, which manages dependencies more effectively within the Anaconda ecosystem.

  Using Conda
  Open the Anaconda Prompt and enter:

  ```bash
  conda install pandas
  ```
  This will install pandas and any required dependencies.

  If you want a specific version of pandas, you can specify it like this:

  ```bash
  conda install pandas=1.5.3
  ```
  Creating a New Environment with Pandas
  You can also create a new conda environment with pandas pre-installed:

  ```bash
  conda create -n myenv pandas
  ```
  Replace `myenv` with your desired environment name. To activate the environment, use:

  ```bash
  conda activate myenv
  ```

4. Verifying the Installation
To check that pandas installed correctly, open a Python environment (Colab, Jupyter, or terminal) and run:

  ```python
  import pandas as pd
  print(pd.__version__)  # This should display the installed pandas version
  ```
  This will confirm that pandas is installed and display the version.

In [None]:
import pandas as pd
print(pd.__version__)  # This should display the installed pandas version

### Importing the Library and Loading Data
First, let's import pandas and load a sample dataset. Here, we’ll use a sample CSV file, which could represent any real-world dataset.

```python
import pandas as pd

# Load data from a CSV file
# In real cases, you would use a file path, e.g., 'data.csv'
df = pd.read_csv("sample_data.csv")  # Replace with the path to your CSV file

# Display the first few rows of the dataset
df.head()
```




In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd

# Load data from a CSV file
df = pd.read_csv("sample_data/california_housing_test.csv")

# Display the first few rows of the dataset
df.head()

### Understading the Data

1. Viewing the Structure of the DataFrame
To get a quick overview of the DataFrame, use:

```python
df.info()
```
This will show you the number of entries, column names, non-null counts, and data types.

```python
df.describe()
```
The output is another DataFrame that includes several key statistics for each numerical column:

- count: The number of non-null entries.
- mean: The average value.
- std: The standard deviation, which measures the amount of variation or dispersion in the data.
- min: The minimum value.
- 25%: The first quartile, or the 25th percentile.
- 50%: The median, or the 50th percentile.
- 75%: The third quartile, or the 75th percentile.
- max: The maximum value.

2. Listing Columns
To list all the columns in the DataFrame:

```python
df.columns
```
This is useful if we need to iterate through the columns of the dataframe

3. Finding Minimum and Maximum Values
To find the minimum and maximum values of numerical columns:

```python
min_values = df[column].min()
max_values = df[column].max()
```
> You can calculate other values like standard deviation, quantile, mean, mode and more using the same syntax.

4. Counting Unique Values
To get the number of unique values in each column:

```python
unique_counts = df[column].nunique()
print("Number of unique values:\n", unique_counts)
```
You can also get the list of unique values using
unique_values = df[column].unique()

5. Counting Null Values
To count the number of null values in each column:

```python
null_counts = df.isnull().sum()
print("Number of null values:\n", null_counts)
```

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.columns



In [None]:
df.dtypes

In [None]:
# Create a dictionary to store results
results = {}

# Calculate metrics for numeric columns
numeric_df = df.select_dtypes(include='number')
for column in numeric_df.columns:
    results[column] = {
        'min': numeric_df[column].min(),
        'max': numeric_df[column].max(),
        'mean': numeric_df[column].mean(),
        'std': numeric_df[column].std(),
        'count': numeric_df[column].count(),
        'p90': numeric_df[column].quantile(0.25),
        'p75': numeric_df[column].quantile(0.75),
        'p50': numeric_df[column].quantile(0.5),
        'p25': numeric_df[column].quantile(0.25),
        'unique_count': numeric_df[column].nunique(),
        'most_common': numeric_df[column].mode()[0] if not numeric_df[column].mode().empty else None,
        'count_non_null': numeric_df[column].count(),
    }

# Calculate metrics for non-numeric columns
non_numeric_df = df.select_dtypes(exclude='number')
for column in non_numeric_df.columns:
    results[column] = {
        'unique_count': non_numeric_df[column].nunique(),
        'most_common': non_numeric_df[column].mode()[0],
        'count_non_null': non_numeric_df[column].count(),
    }

# Convert results to a DataFrame for better visualization
# the .T attribute is used to transpose the DataFrame
results_df = pd.DataFrame(results).T

results_df


In [None]:
# This will create a new interactive google sheet with the data in the dataframe passed as paramter (results_df)
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=results_df)

In [None]:
unique_values = df['housing_median_age'].unique()
print(unique_values)

In [None]:
# seaborn library is a powerful visualization tool built on top of matplotlib.
# It provides a high-level interface for drawing attractive statistical graphics.
import seaborn as sns

# import the matplotlib.pyplot module and assign it to the alias 'plt'
import matplotlib.pyplot as plt

# Initialize a new figure for the plot: 10 inches width and 6 inches height.
plt.figure(figsize=(10, 6))
#Histogram with Density Plot
"""
:
sns.histplot(...): This function creates a histogram for the specified data.
The bins parameter determines how many intervals will be used in the histogram.
kde=True: enables the kernel density estimate (KDE) plot, which provides a smoothed version of the histogram.
 It shows the probability density function of the variable, helping to visualize the distribution more clearly.
color='grey': sets the color of the histogram bars to grey.
alpha=0.5: sets the transparency level of the histogram bars, making them semi-transparent (50% opacity), which can help in visualizing overlapping elements.
"""
sns.histplot(df['housing_median_age'], bins=int(df['housing_median_age'].max()) - int(df['housing_median_age'].min()), kde=True, color='grey', alpha=0.5)

plt.title('Histogram with Density Plot')
plt.xlabel('Housing median age')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()


In [None]:
unique_counts = df.nunique()
print("Number of unique values:\n", unique_counts)



####🏆 Understanding the Data Challenge

**Descriptive Statistics Calculation of NYC Taxi Trips**

In this challenge, you will analyze a dataset to calculate descriptive statistics for both numeric and non-numeric columns.

The goal of this challenge is to create a dictionary that stores descriptive statistics for each column in the dataset.

This will include metrics for numeric columns (like min, max, mean, standard deviation, percentiles, and unique counts) and key statistics for non-numeric columns (like unique counts and the most common value).

Finally, you will convert this information into a DataFrame for easier visualization.

The data source will be the parquet file located at https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet


Steps to Accomplish the Challenge

- Load the Dataset:

  Ensure you have the dataset loaded into a DataFrame named df.

- Create a Results Dictionary:

  Initialize an empty dictionary to store the results of your calculations.

- Calculate Metrics for **Numeric** Columns:

  Use the `select_dtypes()` method to create a DataFrame containing only numeric columns.

  Iterate through the numeric columns and calculate the following metrics:
  - Minimum value
  - Maximum value
  - Mean
  - Standard deviation
  - Count of non-null values
  - Percentiles (25th, 50th, 75th)
  - Unique count
  - Most common value (mode)
  
- Calculate Metrics for **Non-Numeric** Columns:

  Create a DataFrame for non-numeric columns using select_dtypes().
  
  Iterate through the non-numeric columns and calculate:
  - Unique count
  - Most common value (mode)
  - Count of non-null values

- Convert Results to a DataFrame:

  Convert the results dictionary into a DataFrame for better visualization, using the `.T` attribute to transpose it.

- Display the Results:

  Print the resulting DataFrame to review the summary statistics for all columns.


### Data Cleaning
Real-world datasets often have missing or inconsistent data. Pandas makes it easy to identify and clean this data.

We can apply different strategies

1. Handling Missing Values

  In case of a missing value we can fill it with an aggregated value like the mean of the column.

  - Initial Check:
  
    Start with `df.isnull().sum()` to get a quick overview of null values in the dataframe

  - Detailed Analysis:

    If you notice a significant number of missing values in certain columns, you might want to investigate those columns further. For example, you could check the data types and consider how to handle missing values based on the type of data (`mean` for numeric, `mode` for categorical, etc.).

  ```python  
  # Check for missing values
  print(f"Number of null values:\n{df.isnull().sum()}")

  # Fill missing values
  for column in df.columns:
      if df[column].dtype in ['float64', 'int64']:  # Check if the column is numeric
          df[column].fillna(df[column].mean(), inplace=True)  # Fill with mean
      elif df[column].dtype == 'object':  # Check if the column is categorical
          df[column].fillna(df[column].mode()[0], inplace=True)  # Fill with mode

  # Verify that there are no more missing values
  print(f"Number of null values after filling:\n{df.isnull().sum()}")
  ```

2. Dropping Duplicates

  In case of missing values we can also delete all the row.
  ```python
  # Drop duplicate rows, if any
  df.drop_duplicates(inplace=True)
  ```


####🏆 Data Cleaning Challenge

**Data Cleaning of Taxi Trip Data**

In this challenge, you will work with a dataset of yellow taxi trip data to perform data cleaning tasks. Your objective is to handle missing values in the dataset appropriately, ensuring that it is ready for analysis.

The data source will be the parquet file located at https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

Steps to Accomplish the Challenge

- Load the Dataset:

  Use the pandas library to read the parquet file from the provided URL and store it in a DataFrame named df_taxi.
  
- Check for Missing Values:

  Print the number of null values in each column of the DataFrame to understand the extent of missing data.

- Fill Missing Values:

  Iterate through each column in the DataFrame:
  - For numeric columns (of type `float64` or `int64`), fill missing values with the mean of that column.
  - For categorical columns (of type `object`), fill missing values with the mode (most frequent value) of that column.

- Verify Completion:

  Print the number of null values in each column again to confirm that all missing values have been successfully filled.



### Filtering and Selecting Data
Pandas makes it easy to filter data based on certain conditions and select specific columns.

1. Selecting Columns
  ```python
  # Select specific columns
  subset = df[['Column1', 'Column2']]
  ```

2. Filtering Rows Based on Conditions
  ```python
  # Filter rows where 'Column1' > 50
  filtered_data = df[df['Column1'] > 50]
  ```

  To apply multiple conditions when filtering a DataFrame in pandas, you should use the bitwise operators `&` (for AND) or `|` (for OR) instead of the `and` keyword. Additionally, each condition needs to be enclosed in parentheses

  ```python
  # Filter rows where 'Column1' > 50 and at the same time 'Column2' > 0
  filtered_data = df[(df['Column1'] > 50) & (df['Column2']> 0)]
  ```


####🏆 Filtering and Selecting Data challenge

**Filtering Taxi Trip Data**

In this challenge, you will work with a dataset of yellow taxi trip data to filter out trips based on specific criteria related to passenger count and trip distance. Your objective is to create a new DataFrame that includes only valid taxi trips for further analysis.

The data source will be the parquet file located at https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

Conditions:
- the number of passengers is between 1 and 5 (inclusive)
- the trip distance is greater than 0.

The final result dataframe should include only the following columns:
```
'passenger_count', 'trip_distance', 'fare_amount', 'tip_amount', 'total_amount', 'VendorID','tpep_pickup_datetime', 'tpep_dropoff_datetime'
```

This will help ensure that the data you analyze is relevant and meets specific criteria.

Steps to Accomplish the Challenge
- Load the Dataset:

  Ensure you have the yellow taxi trip dataset loaded into a DataFrame named df_taxi.

- Understand the Data:

  Familiarize yourself with the DataFrame's structure, including the columns and their data types, focusing on passenger_count and trip_distance.

- Filter the Data:

  Use boolean indexing to create a new DataFrame, filtered_data, that meets the following conditions:
  The passenger_count must be greater than 0 and less than or equal to 5.
  The trip_distance must be greater than 0.

- Filter Columns:

  Include only the columns that are relevant for the analysis.

- Verify the Filtered Data:

  Check the shape or the first few rows of the filtered_data DataFrame to ensure the filtering was applied correctly and that it contains only the desired trips.


*passenger_count*>0 and passenger_count<=5

### Grouping and Aggregating Data
Grouping data is essential in data analytics to calculate summary statistics for different categories.

We use `groupby` to group the DataFrame by the indicated column. Each unique value of it will create a separate group. For example, if there are groups for 1, 2, 3, etc., each of these groups will consist of rows.

There are multiple aggregate functions that can be used:

- `sum()`:Calculates the sum of the values for each group.
```python
grouped_data = df.groupby('CategoryColumn').sum().reset_index()
```
- `count()` :Counts the number of non-null values for each group.
```python
grouped_data = df.groupby('CategoryColumn').count().reset_index()
```
- `min()`:Finds the minimum value for each group.
```python
grouped_data = df.groupby('CategoryColumn').min().reset_index()
```
- `max()`: Finds the maximum value for each group.
```python
grouped_data = df.groupby('CategoryColumn').max().reset_index()
```
- `std()`: Calculates the standard deviation of the values for each group.
```python
grouped_data = df.groupby('CategoryColumn').std().reset_index()
```
- `median()`: Calculates the median value for each group.
```python
grouped_data = df.groupby('CategoryColumn').median().reset_index()
```
- `first()`: Returns the first value in each group.
```python
import pandas as pd
# Sample DataFrame
data = {
    'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
    'Value': [10, 15, 10, 5, 20, 25],
    'Date': ['2024-01-01', '2024-01-02', '2024-02-01', '2024-02-03', '2024-03-02', '2024-03-01']
}
df = pd.DataFrame(data)
# Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Sort the DataFrame by 'Category' and 'Date'
df_sorted = df.sort_values(by=['Category', 'Date'])
# Group by 'Category' and get the first entry for each group
first_entries = df_sorted.groupby('Category').first().reset_index()
print(first_entries)
```
- `agg()`:   The `agg()` function is used to specify the aggregation operations that will be applied to each group created by the `groupby()`. This allows you to calculate multiple statistics in a single step.

  Inside `agg()`, each aggregation is defined in the format `new_column_name=('original_column_name', 'function')`.
  This creates a new column named new_column_name in the resulting DataFrame.




> If you don’t add `reset_index()` after using `groupby()` and `agg()`, the result will be a DataFrame with a hierarchical index (multi-index) that consists of the grouping columns (in this case, passenger_count). This means that the grouping column(s) will become the index of the resulting DataFrame rather than regular columns.

**Implications of Not Using reset_index()**

- Accessing Data: You will have to use the .loc[] or .iloc[] methods to access the data, which can be less intuitive than working with a standard DataFrame.

- DataFrame Appearance: The DataFrame will look different because the columns used to group by will be part of the index rather than a column, which may make it harder to interpret at a glance.

- Subsequent Operations: If you want to perform further operations that expect the grouping variable to be a column (like merging, filtering, or plotting), you may need to reset the index later on, which can add extra steps to your workflow.

In [None]:
# Use this as a playground to explore with grouping and aggregations

####🏆 Grouping and Aggregating Data Challenge

**Does the number of passengers affect the tip amount for the taxi trip?**

You are tasked with analyzing taxi trip data to understand how passenger count affects tipping behavior and trip characteristics.

Using the filtered DataFrame from the previous challenge, calculate the average tip amount, average trip distance, and the total number of trips for different passenger counts.

- The average tip amount for each group of passenger counts.
- The average trip distance for each group.
- The total number of trips for each group.

By completing this challenge, you will gain insights into the relationship between passenger count and trip metrics, which can be valuable for understanding customer behavior and improving service.

Steps to Accomplish the Challenge
- Import Required Libraries:

  Ensure you have the necessary libraries imported, particularly pandas.

- Load the Data:

  Load the taxi trip data into a DataFrame. If using a filtered DataFrame (e.g., filtered_data), make sure it has already been created based on relevant criteria (such as valid passenger counts).

- Group the Data:

  Use the groupby() method on the passenger_count column to group the data accordingly.

- Aggregate the Statistics:

  Use the agg() function to compute:
  - The average of the tip_amount column (average tip).
  - The average of the trip_distance column (average trip distance).
  - The count of non-null values (number of trips).

  Ensure to name the resulting columns appropriately.

- Reset the Index:

  Call reset_index() to convert the grouped object back into a standard DataFrame format, making it easier to read and manipulate.

- Display the Results:

  Print or display the resulting DataFrame average_stats to view the aggregated statistics.

### Data Visualization with Pandas
You can create quick plots directly from pandas, which integrates well with Matplotlib for more complex visualizations.

Matplotlib

- Line Chart: Used to display data points connected by straight lines, ideal for showing trends over time. `plt.plot(x, y)`
- Bar Chart: Displays data with rectangular bars representing the values, useful for comparing different categories. `plt.bar(x, height)`
- Horizontal Bar Chart: Similar to a bar chart, but bars are displayed horizontally. `plt.barh(y, width)`
- Histogram: Used to represent the distribution of numerical data by dividing the data into bins. `plt.hist(data, bins=10)`
- Scatter Plot: Displays individual data points using Cartesian coordinates, useful for showing relationships between two variables.
`plt.scatter(x, y)`
- Pie Chart: Represents proportions of a whole as slices of a circle, suitable for showing percentage distributions.
`plt.pie(sizes, labels=labels)`
- Box Plot: Displays the distribution of data based on five summary statistics: minimum, first quartile, median, third quartile, and maximum. `plt.boxplot(data)`
- Heatmap: Represents data values in a matrix format using color gradients, useful for visualizing correlations or patterns. `plt.imshow(data, cmap='hot', interpolation='nearest')`
- Area Chart: Similar to line charts, but the area below the line is filled in, highlighting the volume of data. `plt.fill_between(x, y)`
- Violin Plot: Combines box plot and kernel density plot to show data distribution and density, useful for comparing distributions between multiple groups. `plt.violinplot(data)`


[Documentation](https://matplotlib.org/)

![Matplotlib cheat sheet](https://matplotlib.org/cheatsheets/_images/cheatsheets-1.png)(https://matplotlib.org/cheatsheets/)

![Matplotlib cheat sheet](https://matplotlib.org/cheatsheets/_images/cheatsheets-2.png)(https://matplotlib.org/cheatsheets/)


####  Simple Line Plot

A simple line chart is best used in the following scenarios:

- **Time Series Data**: When you want to visualize trends over time, such as daily, monthly, or yearly data. Line charts effectively show how values change over intervals.

- **Continuous Data**: Ideal for displaying continuous data where you expect smooth transitions between points, such as temperature changes, stock prices, or sales figures.

- **Comparison of Multiple Series**: When you need to compare multiple related datasets over the same time period. Multiple lines can be plotted on the same chart to show how different groups compare over time.

- **Highlighting Trends**: Use line charts to identify upward or downward trends and cycles in the data, helping to visualize long-term trends effectively.

- **Simple Relationships**: When you want to illustrate the relationship between two continuous variables (e.g., time and temperature) without the need for more complex visualizations.

Overall, a simple line chart is a powerful and straightforward tool for presenting data in a way that highlights changes, trends, and comparisons clearly.


```python
# Plot a line chart for a time series
df['DateColumn'] = pd.to_datetime(df['DateColumn'])
df.set_index('DateColumn')['NumericalColumn'].plot(title='Time Series Plot')
```


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data for a month
date_range = pd.date_range(start='2024-01-01', end='2024-01-31')
np.random.seed(0)  # For reproducibility
numerical_values = np.random.randint(0, 100, size=len(date_range))

# Create a DataFrame
df = pd.DataFrame({
    'DateColumn': date_range,
    'NumericalColumn': numerical_values
})

# Convert 'DateColumn' to datetime
df['DateColumn'] = pd.to_datetime(df['DateColumn'])

# Plot a line chart for the time series
df.set_index('DateColumn')['NumericalColumn'].plot(title='Time Series Plot for January 2024')
plt.xlabel('Date')
plt.ylabel('Value')

# Adjust the y-axis to start at 0
plt.ylim(0, df['NumericalColumn'].max()+10)  # Set the y-axis limits

plt.grid()
plt.show()


#### 🏆 Line Chart Challenge

**Analyzing Daily Taxi Trip Data**

The objective of this challenge is to analyze and visualize daily taxi trip data for January 2024. You will calculate the number of trips taken each day and create a time series plot to display this information clearly.

Steps to Accomplish the Challenge:
- Import Necessary Libraries:
- Load Your Data: Load the taxi trip data into a DataFrame.You can use the data generated in the previous challenge.
- Convert Timestamp to Datetime:
Convert the tpep_pickup_datetime column to a datetime format for easier manipulation.
- Extract the Date: Create a new column called date_pickup that contains only the date part of the tpep_pickup_datetime.
- Group Data by Date: Group the data by the date_pickup column and calculate the number of trips (using the count of tip_amount) for each day.
- Create a Time Series Plot: Plot the number of trips against the date using a line chart. Set the date_pickup column as the index.
- Customize the Plot: Add labels for the x-axis and y-axis.
Set the x-axis limits to display only the data for January 2024.
Rotate x-axis labels to 90 degrees for better readability.
Enable grid lines for better readability and display the plot.


#### Bar Plot for Categorical Data

Bar plot charts are particularly useful in the following scenarios:

- **Categorical Data Comparison**: When you need to compare quantities across different categories. Bar plots clearly display the differences in counts or measurements among distinct groups.

- **Discrete Variables**: Ideal for visualizing discrete data where categories are non-numerical, such as survey responses (e.g., preferences, types of products, etc.).

- **Ranking**: Bar charts can effectively show the ranking of categories based on their values. For example, you might use a bar plot to display sales figures for different products, making it easy to see which product performed best.

- **Frequency Distribution**: When you want to show the frequency distribution of a categorical variable, such as the number of occurrences of different species in a dataset.

- **Multiple Series Comparison**: Bar plots can display multiple datasets side-by-side (grouped bar plots) or stacked, allowing for easy comparison of different categories across multiple groups (e.g., sales by region and by product).

- **Data Presentation**: When you need a clear and straightforward way to present data in reports or presentations, bar plots are easily understood and visually impactful.

- **Visualization of Changes Over Time** (for Categorical Data): While line charts are generally preferred for continuous data over time, bar plots can also effectively display changes in categorical data over discrete time intervals (e.g., monthly sales by category).

Overall, bar plots are a versatile tool for visualizing categorical data and comparing quantities, making them a staple in data analysis and presentation.

```python
# Plot a bar chart for the counts of each category
df['CategoryColumn'].value_counts().plot(kind='bar', title='Category Counts')
```


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Sample data for categorical counts
data = {
    'CategoryColumn': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'D', 'D', 'C']
}
df = pd.DataFrame(data)

# Count occurrences of each category
category_counts = df['CategoryColumn'].value_counts()

# Plot a bar chart for the counts of each category
plt.figure(figsize=(10, 6))  # Set the figure size
category_counts.plot(kind='bar', color='skyblue', edgecolor='black', title='Category Counts')

# Adding labels and title
plt.xlabel('Categories')
plt.ylabel('Count')
plt.xticks(rotation=0)  # Rotate x-axis labels for better readability

# Adding a grid for easier reading
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the value on top of each bar
for index, value in enumerate(category_counts):
    plt.text(index, value, str(value), ha='center', va='bottom')

# Show the plot
plt.tight_layout()  # Adjust layout for better spacing
plt.show()


####🏆 Bar Chart Challenge

**Analyzing Taxi Trip Volume by Vendor**

The objective of this challenge is to analyze and visualize the volume of taxi trips per vendor using the provided dataset from the Filtering and Selecting Data Challenge. You will count the number of trips associated with each vendor and create a bar chart to display this information clearly.

Steps to Accomplish the Challenge:
- Import Necessary Libraries
- Load Your Data
- Count Trip Occurrences per Vendor
- Create a Bar Chart
- Add Labels and Titles: Label the x-axis as "Vendor ID" and the y-axis as "Trips". Rotate x-axis labels if necessary for better readability.
- Enhance Readability with Grid Lines
- Display Values on Top of Bars
- Show the Plot

By following these steps, you will generate a bar chart that visually represents the volume of taxi trips for each vendor. This visualization will help identify trends in trip distribution among different vendors and provide insights into their relative performance.

#### Scatter Plot for Numerical Relationships

Scatter plots are useful in the following scenarios:

- **Relationship Between Two Variables**: When you want to explore the relationship or correlation between two numerical variables. Scatter plots can reveal whether a positive, negative, or no correlation exists.

- **Identifying Trends**: They help visualize trends in data, showing how one variable changes in relation to another. For example, you might use a scatter plot to analyze how temperature affects ice cream sales.

- **Outlier Detection**: Scatter plots are effective for identifying outliers or anomalies in the data. Points that fall far away from the general cluster of data can indicate unusual behavior or errors in data collection.

- **Distribution Visualization**: When you need to see the distribution of data points across two dimensions. This can help in understanding the density and spread of data.

- **Multivariable Analysis**: If you want to explore potential interactions between two variables while considering other variables, scatter plots can be enhanced with color or size encoding to represent additional data dimensions.

- **Comparative Analysis**: They can be used to compare different groups or categories within the data. By using different colors or markers for different categories, you can see how they relate to each other.

- **Regression Analysis**: Scatter plots are often the first step in regression analysis, allowing you to visualize the data before fitting a regression model to understand the relationship better.

In summary, scatter plots are a powerful tool for visualizing relationships and distributions in numerical data, making them invaluable for exploratory data analysis and hypothesis testing.

```python
# Plot a scatter plot to see relationships between two numerical columns
df.plot(kind='scatter', x='NumericalColumn1', y='NumericalColumn2', title='Scatter Plot')
```

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Sample data for two numerical columns
np.random.seed(0)  # For reproducibility
data = {
    'NumericalColumn1': np.random.rand(100) * 100,  # Random values between 0 and 100
    'NumericalColumn2': np.random.rand(100) * 100  # Random values between 0 and 100
}

df = pd.DataFrame(data)

# Create a scatter plot to see relationships between two numerical columns
plt.figure(figsize=(10, 6))  # Set the figure size
plt.scatter(df['NumericalColumn1'], df['NumericalColumn2'],
            color='blue', alpha=0.6, edgecolor='black')

# Adding titles and labels
plt.title('Scatter Plot of NumericalColumn1 vs NumericalColumn2')
plt.xlabel('NumericalColumn1')
plt.ylabel('NumericalColumn2')

# Adding a grid for better readability
plt.grid(True, linestyle='--', alpha=0.7)

# Optionally, you can add a regression line (if applicable)
# Here, we calculate a simple linear regression line
m, b = np.polyfit(df['NumericalColumn1'], df['NumericalColumn2'], 1)
plt.plot(df['NumericalColumn1'], m * df['NumericalColumn1'] + b, color='red', linewidth=2)

# Show the plot
plt.tight_layout()  # Adjust layout for better spacing
plt.show()


#### 🏆 Scatterplot Chart Challenge

**Analyzing Trip Data Relationships**

The goal of this challenge is to analyze the relationship between the total amount charged for taxi trips and the distance traveled. This involves cleaning the dataset by removing outliers and correcting data issues, followed by visualizing the results with a scatter plot and a regression line.

High-Level Steps:

- Data Cleaning:

  - Identify and filter out outlier trips with excessive trip distances (greater than 200 kilometers).
  - Convert any negative values in the total_amount column to their absolute values to ensure all amounts are non-negative.

- Data Visualization:

  - Create a scatter plot to visualize the relationship between total_amount and trip_distance.
  - Set appropriate titles and labels for clarity, including a grid for better readability.

- Regression Analysis:

  - Calculate the linear regression line to understand the correlation between the two variables. Use `numpy.polyfit()` to obtain the slope and intercept.
  - Generate a range of x-values for the regression line based on the cleaned total_amount data.
  - Calculate the corresponding y-values using the linear regression equation.

- Final Visualization:

  Plot the regression line over the scatter plot to illustrate the relationship between the total amount charged and trip distance.
  Display the final plot with proper layout adjustments for optimal viewing.

This challenge will help you practice data cleaning, visualization techniques, and regression analysis using Python and its libraries.

### Advanced Data Analytics: Calculating Correlations
Correlation analysis helps identify relationships between numerical columns, often a key part of data analytics.

```python

# Calculate correlations between numerical columns
correlation_matrix = df.corr()

# Display the correlation matrix
print(correlation_matrix)

# Plot a heatmap (requires seaborn library)
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
```


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data creation
np.random.seed(0)  # For reproducibility

# Creating a DataFrame with realistic numerical data
data = {
    'Sales': np.random.normal(200, 50, 1000),  # Normal distribution of sales
    'Advertising': np.random.normal(50, 10, 1000),  # Advertising expenses
    'Profit': np.random.normal(30, 5, 1000),  # Profit values
    'Customer_Rating': np.random.uniform(1, 5, 1000),  # Customer ratings from 1 to 5
    'Product_Cost': np.random.normal(20, 5, 1000)  # Cost of the product
}

# Adjusting profit to be related to sales and advertising
data['Profit'] = data['Sales'] * 0.15 - data['Product_Cost'] + np.random.normal(0, 5, 1000)

df = pd.DataFrame(data)

# Calculate correlations between numerical columns
correlation_matrix = df.corr()

# Display the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)

# Plot a heatmap
plt.figure(figsize=(10, 8))  # Set the figure size
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", square=True, linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()


### Applying Functions to Columns in DataFrames
In data analytics, applying functions to columns of a DataFrame is a powerful technique for transforming and manipulating data. This can be useful for a variety of scenarios, such as data cleaning, feature engineering, and deriving new insights from existing data.

Here, we will explore several examples and use cases to illustrate how to apply functions to columns in a pandas DataFrame.


#### Squaring Values in a Numerical Column

Use Case: This transformation could be used in a scenario where you want to create polynomial features for a regression model.




In [None]:
import pandas as pd

# Sample data
data = pd.DataFrame({
    'NumericalColumn': [1, 2, 3, 4, 5]
})

# Apply a custom function to square the values in 'NumericalColumn'
data['SquaredValues'] = data['NumericalColumn'].apply(lambda x: x ** 2)

print(data)

#### Converting Units

Use Case: This is useful in scenarios where temperature data needs to be standardized to a specific unit for analysis or visualization.




In [None]:
# Sample data with temperature in Celsius
temperature_data = pd.DataFrame({
    'Celsius': [0, 20, 100, -10]
})

# Function to convert Celsius to Fahrenheit
def celsius_to_fahrenheit(celsius):
    return (celsius * 9/5) + 32

# Apply the conversion function
temperature_data['Fahrenheit'] = temperature_data['Celsius'].apply(celsius_to_fahrenheit)

print(temperature_data)

#### Text Processing

Use Case: This can be particularly useful in data cleaning, where inconsistent capitalization in text data needs to be standardized before further analysis or reporting.


In [None]:
import pandas as pd
# Sample data with names
name_data = pd.DataFrame({
    'Names': ['alice', 'BOB', 'Charlie', 'dave']
})

# Apply a function to capitalize the first letter of each name
name_data['CapitalizedNames'] = name_data['Names'].apply(lambda x: x.capitalize())

print(name_data)

#### Conditional Transformations

Use Case: This transformation can help categorize numerical values into discrete categories, which is useful for summarizing performance metrics.


In [None]:
import pandas as pd
# Sample data with scores
scores_data = pd.DataFrame({
    'Scores': [85, 92, 78, 90, 60]
})

# Apply a function to categorize scores
scores_data['Grade'] = scores_data['Scores'].apply(lambda x: 'A' if x >= 90 else ('B' if x >= 80 else 'C'))

print(scores_data)

#### Date Manipulation

Use Case: Extracting specific date components is crucial for time series analysis and allows for grouping or filtering by time periods.


In [None]:

# Sample data with dates
date_data = pd.DataFrame({
    'Date': ['2023-01-01', '2023-06-15', '2023-09-20']
})

# Convert to datetime format and extract the month
date_data['Date'] = pd.to_datetime(date_data['Date'])
date_data['Month'] = date_data['Date'].apply(lambda x: x.month)

print(date_data)
