AIM #1: Loading the dataset and printing basic information 
1. Import the Titanic dataset using pandas
2. Create a Dataframe from the dataset
3. Print the first 10 rows of the dataset
4. Print the last 20 rows of the dataset
5. Print dataset's information
6. Describe the dataset
7. Make sure all the information returned by the different functions are displayed in a single table and not on multiple ines

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('titanic.csv')

# Print the first 10 rows
print("First 10 rows:")
print(df.head(10))

# Print the last 20 rows
print("\nLast 20 rows:")
print(df.tail(20))

# Print dataset information
print("\nDataset information:")
print(df.info())

# Describe the dataset
print("\nDataset description:")
print(df.describe())

AIM #2: Finding issues (empty, NAs, incorrect value, incorrect format, outliers, etc.) 
1. Find out how many missing values there are in the dataset
2. For the 'Age' column, find the best way to handle the missing values
    2.1. Use an appropriate plot to study the nature of the 'Age' column
    2.2. Figure out what is the best way to calculate the central tendency of the 'Age' column based on the above plot
    2.3. Using the most suitable central tendency measure, fill the missing values in the age column
3. Decide what is the best way to handle the missing values in the 'Cabin' columns
4. Similarly, decide what is the best way to handle the missing values in the 'Embarked' columns
5. Handle the incorrect data under the 'Survived' columns using appropriate measure
6. Handle the incorrectly formatted data under the 'Fare' column


In [None]:
# Calculate the number of missing values
print("\nMissing values count:")
print(df.isnull().sum())

# Handle 'Age' column
import seaborn as sns
import matplotlib.pyplot as plt

# 2.1 Plot the distribution of the 'Age' column
plt.figure(figsize=(10,6))
sns.histplot(df['Age'].dropna(), bins=30, kde=True)
plt.title('Age Distribution')
plt.show()

# 2.2 Calculate the median
median_age = df['Age'].median()
print(f"\nMedian Age: {median_age}")

# 2.3 Fill missing 'Age' values
df['Age'].fillna(median_age, inplace=True)

# Handle 'Cabin' column missing values
# Usually fill with the most frequent value or 'Unspecified'
most_common_cabin = df['Cabin'].mode()[0]
df['Cabin'].fillna(most_common_cabin, inplace=True)

# Handle 'Embarked' column
# Fill with the most frequent value
most_common_embarked = df['Embarked'].mode()[0]
df['Embarked'].fillna(most_common_embarked, inplace=True)

# Handle incorrect data under 'Survived' column
# Ensure all values are 0 or 1
df['Survived'] = df['Survived'].astype(int)

# Handle incorrectly formatted data under 'Fare' column
# Ensure all values are numeric
df['Fare'] = df['Fare'].replace('[\$,]', '', regex=True).astype(float)

AIM #3: Grouping 
1. Find out the average fare grouped by Pclass
    1.1. Plot the above using a suitable plot
2. Find out the average fare grouped by Sex
    2.1. Plot the above using a suitable plot

In [None]:
# Calculate the average fare grouped by 'Pclass'
average_fare_pclass = df.groupby('Pclass')['Fare'].mean()
print("\nAverage fare by Pclass:")
print(average_fare_pclass)
sns.barplot(x='Pclass', y='Fare', data=df)
plt.show()

# Calculate the average fare grouped by 'Sex'
average_fare_sex = df.groupby('Sex')['Fare'].mean()
print("\nAverage fare by Sex:")
print(average_fare_sex)
sns.barplot(x='Sex', y='Fare', data=df)
plt.show()

AIM #4: Dataset visualization using pandas

1. Plot the distribution of 'Age' using a suitable plot
2. Plot the distribution of 'Fare' using a suitable plot
3. Plot the distribution of 'Pclass' using a suitable plot
4. Plot the distribution of 'Survived' using a suitable plot
5. Plot the distribution of 'Embarked' using a suitable plot
6. Plot the distribution of 'Fare' grouped by 'Survived'
7. Plot the distribution of 'Fare' grouped by 'Pclass'
8. Plot the distribution of 'Age' grouped by 'Survived'
9. Plot the distribution of 'Age' grouped by 'PClass'
10. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
11. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
12. Plot a distribution between 'Age' and 'Fare' to see if there's any relationship
13. Are there any other possibilities to show relationships?

In [None]:
# Plot various distributions
variables = ['Age', 'Fare', 'Pclass', 'Survived', 'Embarked']
for var in variables:
    plt.figure(figsize=(10,6))
    sns.histplot(df[var], kde=True)
    plt.title(f'Distribution of {var}')
    plt.show()

# Grouped plots
grouped_vars = [('Survived', 'Fare'), ('Pclass', 'Fare'), ('Survived', 'Age'), ('Pclass', 'Age')]
for gvar, yvar in grouped_vars:
    plt.figure(figsize=(10,6))
    sns.barplot(x=gvar, y=yvar, data=df)
    plt.title(f'Distribution of {yvar} grouped by {gvar}')
    plt.show()

# Combine 'SibSp' and 'Parch'
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
sns.barplot(x='Survived', y='FamilySize', data=df)
plt.title('Family Size Distribution Grouped by Survival')
plt.show()

sns.barplot(x='Pclass', y='FamilySize', data=df)
plt.title('Family Size Distribution Grouped by Pclass')
plt.show()

# Plot a distribution between 'Age' and 'Fare' to see if there's any relationship
plt.figure(figsize=(10,6))
sns.scatterplot(x='Age', y='Fare', data=df)
plt.title('Age vs Fare')
plt.show()

AIM #5: Correlation

1. Generate a correlation matrix for the entire dataset
2. Find correlation between 'Age' and 'Fare'
3. What other possible correlations can be found in the dataset?

In [None]:
# Generate a correlation matrix
corr_matrix = df.corr()
print("\nCorrelation matrix:")
print(corr_matrix)

# Find the correlation between 'Age' and 'Fare'
age_fare_corr = corr_matrix['Age']['Fare']
print(f"\nCorrelation between Age and Fare: {age_fare_corr}")