# Day 7: Bringing It All Together - Mini Project

Congratulations on making it to Day 7! You've spent the week learning the fundamentals of NumPy, Pandas, and Matplotlib. Now it's time to put all those skills to the test with a mini-project.

Today, you will be an analyst exploring the famous Titanic dataset. Your goal is to load the data, perform some basic cleaning, and create visualizations to answer a few key questions about the passengers and their survival.

**Your tasks today:**
1.  **Load a new dataset** from a URL.
2.  **Inspect the data** using your Pandas skills (`.head()`, `.info()`, `.describe()`).
3.  **Perform a simple data cleaning** task (handling missing values).
4.  **Create visualizations** (bar charts, histograms) to answer specific questions.

Let's begin by importing our libraries and loading the dataset. The Titanic dataset is available online, so we can load it directly into pandas.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# URL for the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

# Load the dataset into a pandas DataFrame
titanic_df = pd.read_csv(url)

---

## Step 1: Inspect the Data

Before we can analyze the data, we need to understand it. What columns are there? What are their data types? Are there missing values?

**Exercise 1.1:** Use the `.head()` method to display the first 5 rows of the `titanic_df` DataFrame.

In [None]:
# Your code here

**Solution 1.1:**

In [None]:
# Solution
titanic_df.head()

**Exercise 1.2:** Use the `.info()` method to get a summary of the DataFrame, including the data types of each column and the number of non-null values.

In [None]:
# Your code here

**Solution 1.2:**

In [None]:
# Solution
titanic_df.info()

---

## Step 2: Data Cleaning

From the output of `.info()`, you probably noticed that the 'Age' column has a lot of missing values (891 entries total, but only 714 non-null values for Age). For our analysis, let's fill these missing ages with the median age of all passengers. The median is often a better choice than the mean when the data might have outliers.

**Exercise 2.1:** 
1. Calculate the median of the 'Age' column.
2. Use the `.fillna()` method to replace the missing values (`NaN`) in the 'Age' column with the median age. Make sure to update the DataFrame.
3. Verify your work by running `.info()` again to see that the 'Age' column now has 891 non-null values.

In [None]:
# Your code here

**Solution 2.1:**

In [None]:
# Solution
# 1. Calculate the median age
median_age = titanic_df["Age"].median()
print(f"The median age is: {median_age}")

# 2. Fill missing values
titanic_df.fillna({"Age": median_age}, inplace=True)

# 3. Verify the result
print("\nDataFrame info after filling missing ages:")
titanic_df.info()

---

## Step 3: Answering Questions with Visualizations

Now that our data is clean, we can start exploring it visually!

**Question 1: How many people survived versus how many did not?**

**Exercise 3.1:** Create a bar chart showing the count of passengers in the 'Survived' column. (0 = No, 1 = Yes).

*Hint: Use the `.value_counts()` method on the 'Survived' column to get the data for your bar chart.*

In [None]:
# Your code here

**Solution 3.1:**

In [None]:
# Solution
survival_counts = titanic_df["Survived"].value_counts()

plt.bar(
    x=survival_counts.index,
    height=survival_counts.values,
    tick_label=["Did Not Survive", "Survived"],
)

plt.title("Survival Count on the Titanic")
plt.xlabel("Outcome")
plt.ylabel("Number of Passengers")

plt.show()

**Question 2: What was the age distribution of passengers on the Titanic?**

**Exercise 3.2:** Create a histogram of the 'Age' column to see the distribution of passenger ages. Customize it with 25 bins and an edge color of 'black' for clarity.

In [None]:
# Your code here

**Solution 3.2:**

In [None]:
# Solution
plt.hist(titanic_df["Age"], bins=25, edgecolor="black")

plt.title("Age Distribution of Titanic Passengers")
plt.xlabel("Age")
plt.ylabel("Frequency")

plt.show()

**Question 3: Did passenger class influence survival rate?**

**Exercise 3.3:** Create a bar chart that shows the survival rate by passenger class ('Pclass').

*Hint: This is a bit more challenging! You'll need to use `.groupby('Pclass')['Survived'].mean()` to calculate the survival rate for each class.*

In [None]:
# Your code here

**Solution 3.3:**

In [None]:
# Solution
survival_by_class = titanic_df.groupby("Pclass")["Survived"].mean()

plt.bar(
    x=survival_by_class.index,
    height=survival_by_class.values,
    color=["tomato", "skyblue", "lightgreen"],
)

plt.title("Survival Rate by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Survival Rate")
plt.xticks([1, 2, 3])  # Ensure ticks are on the class numbers

plt.show()

---

### Congratulations on completing Week 1!

You have successfully loaded, inspected, cleaned, and analyzed a real dataset. You used Pandas to manipulate the data and Matplotlib to uncover insights visually. These are the core skills of any data analyst or data scientist.

Next week, we will dive deeper with more advanced topics in Pandas and an introduction to the powerful SciPy library for scientific computing!