<span style="text-align: center;color:Red;font-size: 20px;text-align: center;font-weight:bold ; font-size:40px">Perform Data Cleaning and Removing Outliers</span>

### Step 1 - Data Cleaning

In [16]:
import pandas as pd

tt=pd.read_csv(r"D:\Future Intern\titanic\train.csv")

tt.head()

tt.shape

tt.isnull().sum()[tt.isnull().sum()>0]

tt.Embarked.mode()

# Filling missing values in 'Age' with the median age
tt['Age'] = tt['Age'].fillna(tt['Age'].median())

# Filling missing values in 'Fare' with the mean fare
tt['Fare'] = tt['Fare'].fillna(tt['Fare'].mean())

# Filling missing values in 'Embarked' with the mode of the Embarked column
tt['Embarked'] = tt['Embarked'].fillna(tt['Embarked'].mode()[0])

# The 'Cabin' column still has too many missing values, so we will drop it
tt_cleaned = tt.drop(columns=['Cabin'])
tt_cleaned = tt.drop(columns=['PassengerId'])

tt_cleaned

tt_cleaned.isnull().sum()[ tt_cleaned.isnull().sum()>0]

print("Task 1 - Step 1: Data Cleaning Completed")
print("Missing values filled for Age, Fare, and Embarked")
print("Cabin column was dropped as it has many null values")
print("Final shape of the dataset:", tt_cleaned.shape)

Task 1 - Step 1: Data Cleaning Completed
Missing values filled for Age, Fare, and Embarked
Cabin column was dropped as it has many null values
Final shape of the dataset: (891, 11)


### Step 2  - Removing Outliers

In [17]:
def remove_outliers(df):
    total_outliers_removed = 0
    numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
    
    for column in numeric_columns:
       
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        
        
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        
        # rows in original dataframe
        original_count = len(df)
        
        # Remove outliers
        df = df[(df[column] >= lower) & (df[column] <= upper)]
        
        # Calculate how many rows were removed
        outliers_removed = original_count - len(df)
        total_outliers_removed += outliers_removed
        
        # Print the number of outliers removed from this column
        print("Outliers removed from:",column,"--->", outliers_removed)
    
    # Returning the cleaned dataframe and the total number of outliers removed
    return df, total_outliers_removed

# Removing outliers from all numeric columns
tt_cleaned, total_outliers_removed = remove_outliers(tt_cleaned)


Outliers removed from: Survived ---> 0
Outliers removed from: Pclass ---> 0
Outliers removed from: Age ---> 66
Outliers removed from: SibSp ---> 39
Outliers removed from: Parch ---> 144
Outliers removed from: Fare ---> 81


# Above code contains code of removing outliers as outliers affect our mean so removed outlier first


<span style="text-align: center;color:Red;font-size: 20px;text-align: center;font-weight:bold ; font-size:40px">Task 2: Calculate Summary Statistics</span>

In [18]:
# can be done with inbuilt summary statistics function

style_df=tt_cleaned.describe().T.style.background_gradient(axis=1,cmap="Spectral")
style_df

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Survived,561.0,0.286988,0.452759,0.0,0.0,0.0,1.0,1.0
Pclass,561.0,2.520499,0.717155,1.0,2.0,3.0,3.0,3.0
Age,561.0,29.171123,8.463058,5.0,24.0,28.0,32.0,54.0
SibSp,561.0,0.190731,0.440357,0.0,0.0,0.0,0.0,2.0
Parch,561.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fare,561.0,13.727918,10.560514,0.0,7.775,8.4583,15.05,53.1


In [19]:
### Can also be done with finding all metrics individually as done below

# Calculate Summary Statistics
mean_Survived = tt_cleaned['Survived'].mean()
median_Survived = tt_cleaned['Survived'].median()
mode_Survived = tt_cleaned['Survived'].mode()[0]
std_Survived = tt_cleaned['Survived'].std()

mean_Pclass = tt_cleaned['Pclass'].mean()
median_Pclass = tt_cleaned['Pclass'].median()
mode_Pclass = tt_cleaned['Pclass'].mode()[0]
std_Pclass = tt_cleaned['Pclass'].std()

mean_Age = tt_cleaned['Age'].mean()
median_Age = tt_cleaned['Age'].median()
mode_Age = tt_cleaned['Age'].mode()[0]
std_Age = tt_cleaned['Age'].std()

mean_SibSp = tt_cleaned['SibSp'].mean()
median_SibSp = tt_cleaned['SibSp'].median()
mode_SibSp = tt_cleaned['SibSp'].mode()[0]
std_SibSp = tt_cleaned['SibSp'].std()

mean_Parch = tt_cleaned['Parch'].mean()
median_Parch = tt_cleaned['Parch'].median()
mode_Parch = tt_cleaned['Parch'].mode()[0]
std_Parch = tt_cleaned['Parch'].std()

mean_fare = tt_cleaned['Fare'].mean()
median_fare = tt_cleaned['Fare'].median()
mode_fare = tt_cleaned['Fare'].mode()[0]
std_fare = tt_cleaned['Fare'].std()


In [20]:
summary_stats = {
    'Metrics': ['Mean', 'Median', 'Mode', 'Standard Deviation'],
    'Survived': [mean_Survived, median_Survived, mode_Survived, std_Survived],
    'Pclass': [mean_Pclass, median_Pclass, mode_Pclass, std_Pclass],
    'Age': [mean_Age, median_Age, mode_Age, std_Age],
    'SibSp': [mean_SibSp, median_SibSp, mode_SibSp, std_SibSp],
    'Parch': [mean_Parch, median_Parch, mode_Parch, std_Parch],
    'Fare': [mean_fare, median_fare, mode_fare, std_fare]
}

summary_df = pd.DataFrame(summary_stats)

In [21]:
styled_summary_df = summary_df.style.background_gradient(axis=1, cmap="viridis")

In [22]:
styled_summary_df

Unnamed: 0,Metrics,Survived,Pclass,Age,SibSp,Parch,Fare
0,Mean,0.286988,2.520499,29.171123,0.190731,0.0,13.727918
1,Median,0.0,3.0,28.0,0.0,0.0,8.4583
2,Mode,0.0,3.0,28.0,0.0,0.0,8.05
3,Standard Deviation,0.452759,0.717155,8.463058,0.440357,0.0,10.560514


<span style="text-align: center;color:Red;font-size: 40px;text-align: center;font-weight:bold">Conclusion For Task 2</span><br>
<span style="text-align: center;color:White;font-size: 20px;text-align: center;font-weight:bold">The dataset shows a lower survival rate, most passengers in lower classes, and a wide range of ages and fares. Few had family members with them.</span>