# Week 3 – Introduction to Data Analysis with Pandas

This notebook is part of **Week 3: Introduction to Data Analysis – Working with Real Data**.

It shows how to:
- Create or load a simple dataset
- Explore the data (rows, columns, summary)
- Find and handle missing values
- Calculate basic statistics
- Do a **mini sales analysis project**: total sales and best‑selling product

You can run this notebook in **Jupyter Notebook**, **JupyterLab**, **VS Code**, or **Google Colab**.

In [None]:
# Install pandas if not already installed (uncomment the next line if needed)
# !pip install pandas

import pandas as pd
print('Pandas version:', pd.__version__)

## 1. Create a Simple Sales Dataset

We will first create a small dataset inside Python. Later, we will save it as a CSV file
and load it again (like reading an Excel/CSV file from disk).

In [None]:
# Create a simple sales dataset
data = {
    'Product': ['Pen', 'Pen', 'Notebook', 'Notebook', 'Pencil', 'Pencil', 'Eraser', 'Pen'],
    'Quantity': [10, 15, 5, 8, 20, None, 12, 7],  # One missing value (None)
    'Price': [5.0, 5.0, 50.0, 50.0, 3.0, 3.0, 2.0, 5.0]
}

df = pd.DataFrame(data)
df

## 2. Save Dataset as CSV and Read It

This simulates working with a real CSV file downloaded from the internet or given by your instructor.

In [None]:
# Save the DataFrame to a CSV file
csv_filename = 'sales_data.csv'
df.to_csv(csv_filename, index=False)
print(f'Dataset saved to {csv_filename}')

# Read the CSV file back into a new DataFrame
sales_df = pd.read_csv(csv_filename)
print('Loaded dataset from CSV:')
sales_df

## 3. Explore the Dataset

We will check:
- First few rows
- Number of rows and columns
- Column names and data types
- Basic statistics

In [None]:
# Display first few rows
print('First 5 rows:')
print(sales_df.head())

# Shape of the dataset (rows, columns)
print('\nShape of dataset (rows, columns):', sales_df.shape)

# Column names
print('\nColumn names:', list(sales_df.columns))

# Info about data types and non-null counts
print('\nDataset info:')
print(sales_df.info())

# Summary statistics for numeric columns
print('\nSummary statistics:')
print(sales_df.describe())

## 4. Check and Handle Missing Values

Our dataset has one missing value in the **Quantity** column. We will:
- Count missing values
- Create a cleaned version of the data by filling missing values with 0
- Optionally, drop rows with missing values

In [None]:
# Count missing values in each column
print('Missing values per column:')
print(sales_df.isnull().sum())

# Option A: Fill missing values with 0
sales_filled = sales_df.fillna(0)
print('\nDataset after filling missing values with 0:')
print(sales_filled)

# Option B: Drop rows with any missing value
sales_dropped = sales_df.dropna()
print('\nDataset after dropping rows with missing values:')
print(sales_dropped)

## 5. Simple Statistics

We calculate:
- Average quantity and price
- Maximum and minimum values
for all numeric columns.

In [None]:
# Mean (average) of numeric columns
print('Average (mean) values:')
print(sales_filled.mean(numeric_only=True))

# Maximum values
print('\nMaximum values:')
print(sales_filled.max(numeric_only=True))

# Minimum values
print('\nMinimum values:')
print(sales_filled.min(numeric_only=True))

## 6. Mini Project – Sales Analysis

We will use the cleaned dataset (`sales_filled`) to:
- Create a new column **Total** = Quantity × Price
- Calculate **total sales** (total revenue)
- Find the **best-selling product** by total quantity
- Create a simple **sales summary report** by product

In [None]:
# Work on a copy of the filled dataset
project_df = sales_filled.copy()

# Create Total = Quantity * Price
project_df['Total'] = project_df['Quantity'] * project_df['Price']
print('Dataset with Total column:')
print(project_df)

# Total sales (overall revenue)
total_sales = project_df['Total'].sum()
print('\nTotal sales (revenue):', total_sales)

# Best-selling product by quantity
product_quantity = project_df.groupby('Product')['Quantity'].sum()
best_selling_product = product_quantity.idxmax()
print('\nBest-selling product (by quantity):', best_selling_product)
print('Quantity sold:')
print(product_quantity)

# Sales summary by product (total revenue per product)
sales_summary = project_df.groupby('Product')['Total'].sum()
print('\nSales summary (revenue per product):')
print(sales_summary)

## 7. Conclusion

In this notebook, you practiced basic data analysis with **pandas**:
- Creating and saving a dataset as CSV
- Loading CSV data into a DataFrame
- Exploring rows, columns, and statistics
- Finding and handling missing values
- Doing a simple sales analysis project

You can now replace the sample data with your **own dataset** (for example, your
own marks, expenses, or another sales file) and repeat the same steps.