# Python Data Science Tutorial

Welcome to this beginner-friendly tutorial on Python data science techniques! In this notebook, you'll learn fundamental concepts and practice with hands-on exercises.

## What You'll Learn

- Working with NumPy arrays
- Data manipulation with Pandas
- Basic data visualization with Matplotlib
- Simple statistical analysis

## How to Use This Notebook

1. Read each section carefully
2. Complete the exercises by filling in the missing code (marked with `# YOUR CODE HERE`)
3. Run each cell to verify your answers
4. Check the answer section if you need help

Let's get started!

## Setup: Import Libraries

First, let's import the libraries we'll be using throughout this tutorial.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', 10)

print("Libraries imported successfully!")

---

## Section 1: NumPy Basics

NumPy is the fundamental package for scientific computing in Python. It provides support for arrays, matrices, and many mathematical functions.

### Creating NumPy Arrays

Here's how to create a simple NumPy array:

In [None]:
# Creating a 1D array
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)

# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", arr2)

### Exercise 1: Create a NumPy Array

Create a NumPy array containing the numbers 10, 20, 30, 40, 50 and store it in a variable called `my_array`. Then print its shape.

In [None]:
# Exercise 1: Create a NumPy array with values 10, 20, 30, 40, 50
# YOUR CODE HERE
my_array = None  # Replace None with your code

# Print the array and its shape
print("Array:", my_array)
print("Shape:", my_array.shape if my_array is not None else "N/A")

### Answer 1

<details>
<summary>Click to reveal the answer</summary>

```python
my_array = np.array([10, 20, 30, 40, 50])

print("Array:", my_array)
print("Shape:", my_array.shape)
```

</details>

### Exercise 2: Array Operations

NumPy allows you to perform operations on entire arrays at once. Calculate the mean and sum of the given array.

In [None]:
# Exercise 2: Calculate mean and sum of the array
data = np.array([15, 25, 35, 45, 55, 65])

# YOUR CODE HERE
array_mean = None  # Calculate the mean using np.mean()
array_sum = None   # Calculate the sum using np.sum()

print(f"Data: {data}")
print(f"Mean: {array_mean}")
print(f"Sum: {array_sum}")

### Answer 2

<details>
<summary>Click to reveal the answer</summary>

```python
data = np.array([15, 25, 35, 45, 55, 65])

array_mean = np.mean(data)
array_sum = np.sum(data)

print(f"Data: {data}")
print(f"Mean: {array_mean}")
print(f"Sum: {array_sum}")
```

</details>

---

## Section 2: Pandas DataFrames

Pandas is a powerful library for data manipulation and analysis. The main data structure is the DataFrame, which is like a table with rows and columns.

### Creating a DataFrame

In [None]:
# Creating a DataFrame from a dictionary
students_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [22, 25, 23, 24, 21],
    'Grade': [85, 90, 78, 92, 88],
    'City': ['New York', 'Boston', 'Chicago', 'Houston', 'Phoenix']
}

df = pd.DataFrame(students_data)
print(df)

### Exercise 3: Create a DataFrame

Create a DataFrame called `products` with the following data:
- Product names: 'Laptop', 'Mouse', 'Keyboard', 'Monitor'
- Prices: 999, 29, 79, 299
- Quantities: 10, 50, 30, 15

In [None]:
# Exercise 3: Create a products DataFrame
# YOUR CODE HERE
products_data = {
    'Product': None,   # Add product names list here
    'Price': None,     # Add prices list here
    'Quantity': None   # Add quantities list here
}

products = pd.DataFrame(products_data)
print(products)

### Answer 3

<details>
<summary>Click to reveal the answer</summary>

```python
products_data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'Price': [999, 29, 79, 299],
    'Quantity': [10, 50, 30, 15]
}

products = pd.DataFrame(products_data)
print(products)
```

</details>

### Exercise 4: DataFrame Selection

Using the DataFrame below, select only the 'Name' and 'Grade' columns and store them in a new DataFrame called `selected_data`.

In [None]:
# Exercise 4: Select specific columns
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [22, 25, 23, 24],
    'Grade': [85, 90, 78, 92],
    'City': ['New York', 'Boston', 'Chicago', 'Houston']
})

# YOUR CODE HERE
selected_data = None  # Select 'Name' and 'Grade' columns using double brackets [[]]

print(selected_data)

### Answer 4

<details>
<summary>Click to reveal the answer</summary>

```python
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [22, 25, 23, 24],
    'Grade': [85, 90, 78, 92],
    'City': ['New York', 'Boston', 'Chicago', 'Houston']
})

selected_data = students[['Name', 'Grade']]

print(selected_data)
```

</details>

### Exercise 5: Filtering Data

Filter the students DataFrame to show only students with a grade greater than 85.

In [None]:
# Exercise 5: Filter data based on a condition
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [22, 25, 23, 24, 21],
    'Grade': [85, 90, 78, 92, 88]
})

# YOUR CODE HERE
high_performers = None  # Filter where Grade > 85

print("Students with grade > 85:")
print(high_performers)

### Answer 5

<details>
<summary>Click to reveal the answer</summary>

```python
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [22, 25, 23, 24, 21],
    'Grade': [85, 90, 78, 92, 88]
})

high_performers = students[students['Grade'] > 85]

print("Students with grade > 85:")
print(high_performers)
```

</details>

---

## Section 3: Data Visualization with Matplotlib

Visualization is crucial for understanding data. Matplotlib is the most widely used plotting library in Python.

### Creating a Simple Line Plot

In [None]:
# Example: Line plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.figure(figsize=(8, 5))
plt.plot(x, y, marker='o', color='blue')
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.title('Simple Line Plot')
plt.grid(True)
plt.show()

### Exercise 6: Create a Bar Chart

Create a bar chart showing the number of sales for each product.

In [None]:
# Exercise 6: Create a bar chart
products = ['Apples', 'Bananas', 'Oranges', 'Grapes', 'Mangoes']
sales = [150, 200, 130, 80, 170]

# YOUR CODE HERE
plt.figure(figsize=(10, 6))
# Use plt.bar() to create the bar chart with products on x-axis and sales on y-axis
# Add xlabel('Products'), ylabel('Sales'), and title('Product Sales')

plt.show()

### Answer 6

<details>
<summary>Click to reveal the answer</summary>

```python
products = ['Apples', 'Bananas', 'Oranges', 'Grapes', 'Mangoes']
sales = [150, 200, 130, 80, 170]

plt.figure(figsize=(10, 6))
plt.bar(products, sales, color='green')
plt.xlabel('Products')
plt.ylabel('Sales')
plt.title('Product Sales')
plt.show()
```

</details>

### Exercise 7: Create a Scatter Plot

Create a scatter plot showing the relationship between study hours and exam scores.

In [None]:
# Exercise 7: Create a scatter plot
study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
exam_scores = [45, 55, 60, 70, 75, 82, 88, 95]

# YOUR CODE HERE
plt.figure(figsize=(10, 6))
# Use plt.scatter() to create the scatter plot
# Add xlabel('Study Hours'), ylabel('Exam Scores'), and title('Study Hours vs Exam Scores')

plt.show()

### Answer 7

<details>
<summary>Click to reveal the answer</summary>

```python
study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
exam_scores = [45, 55, 60, 70, 75, 82, 88, 95]

plt.figure(figsize=(10, 6))
plt.scatter(study_hours, exam_scores, color='red', s=100)
plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.title('Study Hours vs Exam Scores')
plt.show()
```

</details>

---

## Section 4: Basic Statistical Analysis

Pandas provides many built-in methods for statistical analysis.

### Exercise 8: Calculate Basic Statistics

Using the DataFrame below, calculate the mean, median, and standard deviation of the 'Salary' column.

In [None]:
# Exercise 8: Calculate statistics
employees = pd.DataFrame({
    'Employee': ['John', 'Jane', 'Bob', 'Alice', 'Tom', 'Sara'],
    'Department': ['Sales', 'IT', 'IT', 'HR', 'Sales', 'HR'],
    'Salary': [55000, 72000, 68000, 52000, 58000, 54000]
})

# YOUR CODE HERE
salary_mean = None    # Use .mean() on the Salary column
salary_median = None  # Use .median() on the Salary column
salary_std = None     # Use .std() on the Salary column

print(f"Mean Salary: ${salary_mean:,.2f}" if salary_mean else "Mean: N/A")
print(f"Median Salary: ${salary_median:,.2f}" if salary_median else "Median: N/A")
print(f"Salary Std Dev: ${salary_std:,.2f}" if salary_std else "Std Dev: N/A")

### Answer 8

<details>
<summary>Click to reveal the answer</summary>

```python
employees = pd.DataFrame({
    'Employee': ['John', 'Jane', 'Bob', 'Alice', 'Tom', 'Sara'],
    'Department': ['Sales', 'IT', 'IT', 'HR', 'Sales', 'HR'],
    'Salary': [55000, 72000, 68000, 52000, 58000, 54000]
})

salary_mean = employees['Salary'].mean()
salary_median = employees['Salary'].median()
salary_std = employees['Salary'].std()

print(f"Mean Salary: ${salary_mean:,.2f}")
print(f"Median Salary: ${salary_median:,.2f}")
print(f"Salary Std Dev: ${salary_std:,.2f}")
```

</details>

### Exercise 9: Group By Operations

Using the employees DataFrame, calculate the average salary for each department.

In [None]:
# Exercise 9: Group by and aggregate
employees = pd.DataFrame({
    'Employee': ['John', 'Jane', 'Bob', 'Alice', 'Tom', 'Sara'],
    'Department': ['Sales', 'IT', 'IT', 'HR', 'Sales', 'HR'],
    'Salary': [55000, 72000, 68000, 52000, 58000, 54000]
})

# YOUR CODE HERE
# Use groupby('Department') and then .mean() to calculate average salary per department
dept_avg_salary = None

print("Average Salary by Department:")
print(dept_avg_salary)

### Answer 9

<details>
<summary>Click to reveal the answer</summary>

```python
employees = pd.DataFrame({
    'Employee': ['John', 'Jane', 'Bob', 'Alice', 'Tom', 'Sara'],
    'Department': ['Sales', 'IT', 'IT', 'HR', 'Sales', 'HR'],
    'Salary': [55000, 72000, 68000, 52000, 58000, 54000]
})

dept_avg_salary = employees.groupby('Department')['Salary'].mean()

print("Average Salary by Department:")
print(dept_avg_salary)
```

</details>

---

## Section 5: Putting It All Together

### Exercise 10: Complete Data Analysis

In this final exercise, you'll perform a complete mini data analysis:
1. Create a DataFrame with the given data
2. Calculate summary statistics
3. Create a visualization

In [None]:
# Exercise 10: Complete data analysis task

# Step 1: Create a DataFrame
# Data about monthly sales for a store
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
revenue = [12000, 15000, 13500, 18000, 22000, 19500]
expenses = [8000, 9500, 8800, 11000, 13000, 12000]

# YOUR CODE HERE
# Create a DataFrame called 'sales_df' with columns: 'Month', 'Revenue', 'Expenses'
sales_df = None

# Calculate profit (Revenue - Expenses) and add it as a new column called 'Profit'
# sales_df['Profit'] = ???

# Print the DataFrame
print("Sales Data:")
print(sales_df)
print()

# Step 2: Calculate total revenue and average profit
total_revenue = None  # Use .sum() on Revenue column
avg_profit = None     # Use .mean() on Profit column

print(f"Total Revenue: ${total_revenue:,}" if total_revenue else "Total Revenue: N/A")
print(f"Average Monthly Profit: ${avg_profit:,.2f}" if avg_profit else "Avg Profit: N/A")

In [None]:
# Step 3: Create a bar chart comparing Revenue and Expenses by month
# YOUR CODE HERE

# Hint: Create a grouped bar chart
# Use plt.bar() twice with different x positions for Revenue and Expenses
# Or simply plot both series on the same axes using different methods

# Create the visualization here
plt.figure(figsize=(10, 6))

# Your plotting code here...

plt.show()

### Answer 10

<details>
<summary>Click to reveal the answer</summary>

```python
# Step 1: Create a DataFrame
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
revenue = [12000, 15000, 13500, 18000, 22000, 19500]
expenses = [8000, 9500, 8800, 11000, 13000, 12000]

sales_df = pd.DataFrame({
    'Month': months,
    'Revenue': revenue,
    'Expenses': expenses
})

sales_df['Profit'] = sales_df['Revenue'] - sales_df['Expenses']

print("Sales Data:")
print(sales_df)
print()

# Step 2: Calculate statistics
total_revenue = sales_df['Revenue'].sum()
avg_profit = sales_df['Profit'].mean()

print(f"Total Revenue: ${total_revenue:,}")
print(f"Average Monthly Profit: ${avg_profit:,.2f}")

# Step 3: Create visualization
plt.figure(figsize=(10, 6))

x = np.arange(len(months))
width = 0.35

plt.bar(x - width/2, revenue, width, label='Revenue', color='green')
plt.bar(x + width/2, expenses, width, label='Expenses', color='red')

plt.xlabel('Month')
plt.ylabel('Amount ($)')
plt.title('Monthly Revenue vs Expenses')
plt.xticks(x, months)
plt.legend()
plt.show()
```

</details>

---

## Congratulations! ðŸŽ‰

You've completed this Python Data Science Tutorial! You've learned:

- âœ… How to create and manipulate NumPy arrays
- âœ… How to work with Pandas DataFrames
- âœ… How to select and filter data
- âœ… How to create visualizations with Matplotlib
- âœ… How to perform basic statistical analysis
- âœ… How to use groupby for aggregations

## Next Steps

To continue your data science journey, consider:
- Exploring more advanced Pandas operations (merging, pivoting)
- Learning about Seaborn for statistical visualizations
- Diving into machine learning with scikit-learn
- Working with real-world datasets

Happy coding!