<a href="https://colab.research.google.com/github/simonagyosheva/Bioinforamatics-practice/blob/main/03_data_cleaning_patient_data_groups.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
# Learn loading, inspecting data, finding missing values, data normalisation and data export.
# Step 1: Import required libraries
import pandas as pd

# Step 2: Load the sample data (we'll use a sample dataset from a URL)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

# Step 3: Preview the data
print("First 5 rows of the dataset:")
print(df.head())

print("\nDataset shape (rows, columns):", df.shape)

# Step 4: Check for missing data
print("\nMissing values per column:")
print(df.isnull().sum())

# Step 5: Drop columns with too many missing values
df = df.drop(columns=['deck'])  # this column is almost entirely missing

# Step 6: Fill missing values in 'age' with the median
df['age'] = df['age'].fillna(df['age'].median())

# Step 7: Fill missing 'embark_town' with most common value
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

# Step 8: Remove duplicate rows (if any)
df = df.drop_duplicates()

# Step 9: Convert 'sex' to lowercase (normalize)
df['sex'] = df['sex'].str.lower()

# Step 10: Show cleaned dataset info
print("\nCleaned data summary:")
print(df.info())

# Step 11: Save to CSV (optional)
df.to_csv("cleaned_titanic_data.csv", index=False)
# Group by sex, calculate average age
print("Average age by sex:")
print(df.groupby('sex')['age'].mean())

# Group by sex, count survivors
print("\nNumber of survivors by sex:")
print(df.groupby('sex')['survived'].sum())

# Group by class and sex, show average age and survival
print("\nAverage age and survival by class and sex:")
print(df.groupby(['class', 'sex'])[['age', 'survived']].mean())



First 5 rows of the dataset:
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

Dataset shape (rows, columns): (891, 15)

Missing values per column:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare  