In [None]:
Project Title:
Infant Mortality Rate Analysis in Pakistan (1970–2020)
                                            
📁 Dataset Source:
The dataset contains various health and demographic indicators, including Infant Mortality Rate in Pakistan from 1970 to 2020.
It was obtained from a global or national health database (likely WHO or UNICEF), provided in CSV format.

📊 Initial Columns Overview:
Column Name	        Description
Year	            The year in which the data was recorded
Sex	                Gender category: 'Male', 'Female', 'Both sexes', etc.
Value	            Infant mortality rate (per 1000 live births)
Other columns	    Various irrelevant dimensions (age groups, education etc.)

Initial Observations:
======================
The dataset contained many non-numeric or irrelevant entries in Sex, Year, and other columns.
The Value column, which represented the target variable, had some missing/null values.
The Sex column contained unwanted categories like "Unknown", "Total", and others unrelated to gender.

🧹 Data Cleaning Steps:
============================
Dropped Irrelevant Columns
Columns unrelated to the analysis objective were removed (e.g., Age, Education, etc.).
Dropped Missing or Invalid Value Entries
All rows with missing values in the target variable (Value) were removed.
Filtered Relevant Sex Categories
Retained only rows where Sex is:
'Male'
'Female'
'Both sexes'
All other values like 'Unknown', 'Total', 'Q2', etc., were removed.
Filtered Valid Years
Retained rows where Year is a valid 4-digit number (e.g., 1970, 1985, etc.).
Label Encoding
Categorical Sex column was label encoded:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Import the necessary widget for file upload
# Prompt the user to upload a file
data = "F:\\Internships_Works\\health_indicators_pak.csv"  # Replace with the actual file path or use a file upload widget in a notebook environment

# Read the uploaded file into a pandas DataFrame
# ( the uploaded file is a CSV)
data = pd.read_csv(data)
# Display the first few rows of the DataFrame
data.head()

Unnamed: 0,GHO (CODE),GHO (DISPLAY),GHO (URL),YEAR (DISPLAY),STARTYEAR,ENDYEAR,REGION (CODE),REGION (DISPLAY),COUNTRY (CODE),COUNTRY (DISPLAY),DIMENSION (TYPE),DIMENSION (CODE),DIMENSION (NAME),Numeric,Value,Low,High
0,#indicator+code,#indicator+name,#indicator+url,#date+year,#date+year+start,#date+year+end,#region+code,#region+name,#country+code,#country+name,#dimension+type,#dimension+code,#dimension+name,#indicator+value+num,#indicator+value,#indicator+value+low,#indicator+value+high
1,MDG_0000000001,Infant mortality rate (probability of dying be...,https://www.who.int/data/gho/data/indicators/i...,1978,1978,1978,EMR,Eastern Mediterranean,PAK,Pakistan,SEX,SEX_BTSX,Both sexes,128.629383203,128.6 [121.3-136.4],121.298608474,136.43451409
2,CM_02,Number of infant deaths,https://www.who.int/data/gho/data/indicators/i...,1970,1970,1970,EMR,Eastern Mediterranean,PAK,Pakistan,SEX,SEX_FMLE,Female,162243.0,162 243 [150 180-175 188],150180.0,175188.0
3,CM_02,Number of infant deaths,https://www.who.int/data/gho/data/indicators/i...,1968,1968,1968,EMR,Eastern Mediterranean,PAK,Pakistan,SEX,SEX_BTSX,Both sexes,341398.0,341 398 [316 125-367 516],316125.0,367516.0
4,CM_01,Number of under-five deaths,https://www.who.int/data/gho/data/indicators/i...,1994,1994,1994,EMR,Eastern Mediterranean,PAK,Pakistan,SEX,SEX_BTSX,Both sexes,159104.0,159 104 [142 595-177 120],142595.0,177120.0


In [3]:
# Display basic info
print("=== Dataset Info ===")
print(data.info())
print("\n=== First 5 Rows ===")
print(data.head())

=== Dataset Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22418 entries, 0 to 22417
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   GHO (CODE)         22418 non-null  object
 1   GHO (DISPLAY)      22418 non-null  object
 2   GHO (URL)          22418 non-null  object
 3   YEAR (DISPLAY)     22418 non-null  object
 4   STARTYEAR          22418 non-null  object
 5   ENDYEAR            22418 non-null  object
 6   REGION (CODE)      22418 non-null  object
 7   REGION (DISPLAY)   22418 non-null  object
 8   COUNTRY (CODE)     22418 non-null  object
 9   COUNTRY (DISPLAY)  22418 non-null  object
 10  DIMENSION (TYPE)   19009 non-null  object
 11  DIMENSION (CODE)   19009 non-null  object
 12  DIMENSION (NAME)   18965 non-null  object
 13  Numeric            20226 non-null  object
 14  Value              22349 non-null  object
 15  Low                14441 non-null  object
 16  High               

In [4]:
# Check for null/missing values
print("\n=== Null Values ===")
print(data.isnull().sum())


=== Null Values ===
GHO (CODE)              0
GHO (DISPLAY)           0
GHO (URL)               0
YEAR (DISPLAY)          0
STARTYEAR               0
ENDYEAR                 0
REGION (CODE)           0
REGION (DISPLAY)        0
COUNTRY (CODE)          0
COUNTRY (DISPLAY)       0
DIMENSION (TYPE)     3409
DIMENSION (CODE)     3409
DIMENSION (NAME)     3453
Numeric              2192
Value                  69
Low                  7977
High                 7977
dtype: int64


"Keeping those rows would have caused errors or incorrect results in graphs and summaries, so it's a standard data cleaning step."
We dropped rows where 'Value' or 'Numeric' was missing because these columns contain the main numerical data we are analyzing. If these values are missing, the row can't be used for any meaningfu analysis, visualization, or modeling — so we removed them to ensure data quality.


In [5]:
#Drop rows where 'Value' or 'Numeric' is missing
data.dropna(subset=['Value', 'Numeric'], inplace=True)


"These dimension columns provide extra information like gender or age group. When this information is missing, instead of dropping the row — which might still have important data — we fill it with 'Unknown'. This way, we preserve the useful data while clearly marking that some dimension info is not available."
If we drop these rows, we could lose valid data points just because a small piece of extra info was missing. So, to avoid losing valuable data, we fill the missing dimension values with a neutral label 'Unknown'."

In [6]:
# Fill nulls in dimension columns with 'Unknown' (optional)
data['DIMENSION (TYPE)'] = data['DIMENSION (TYPE)'].fillna('Unknown')
data['DIMENSION (CODE)'] = data['DIMENSION (CODE)'].fillna('Unknown')
data['DIMENSION (NAME)'] = data['DIMENSION (NAME)'].fillna('Unknown')

most of the values in 'Low' and 'High' were missing (about 8000 nulls), so they were not useful for basic analysis and only added noise."
"The 'Low' and 'High' columns represent the confidence interval range for the main value. Since we are not analyzing statistical uncertainty or doing advanced modeling, these columns are not needed. So, we dropped them to simplify the dataset and reduce unnecessary columns.

In [7]:
# Drop 'Low' and 'High' columns if not needed
data.drop(columns=['Low', 'High'], inplace=True)

In [8]:
# Reset index after dropping
data.reset_index(drop=True, inplace=True)

In [9]:
# Final check
print("\n=== After Cleaning Nulls ===")
print(data.isnull().sum())
print("\nCleaned Data Sample:")
data.head()


=== After Cleaning Nulls ===
GHO (CODE)           0
GHO (DISPLAY)        0
GHO (URL)            0
YEAR (DISPLAY)       0
STARTYEAR            0
ENDYEAR              0
REGION (CODE)        0
REGION (DISPLAY)     0
COUNTRY (CODE)       0
COUNTRY (DISPLAY)    0
DIMENSION (TYPE)     0
DIMENSION (CODE)     0
DIMENSION (NAME)     0
Numeric              0
Value                0
dtype: int64

Cleaned Data Sample:


Unnamed: 0,GHO (CODE),GHO (DISPLAY),GHO (URL),YEAR (DISPLAY),STARTYEAR,ENDYEAR,REGION (CODE),REGION (DISPLAY),COUNTRY (CODE),COUNTRY (DISPLAY),DIMENSION (TYPE),DIMENSION (CODE),DIMENSION (NAME),Numeric,Value
0,#indicator+code,#indicator+name,#indicator+url,#date+year,#date+year+start,#date+year+end,#region+code,#region+name,#country+code,#country+name,#dimension+type,#dimension+code,#dimension+name,#indicator+value+num,#indicator+value
1,MDG_0000000001,Infant mortality rate (probability of dying be...,https://www.who.int/data/gho/data/indicators/i...,1978,1978,1978,EMR,Eastern Mediterranean,PAK,Pakistan,SEX,SEX_BTSX,Both sexes,128.629383203,128.6 [121.3-136.4]
2,CM_02,Number of infant deaths,https://www.who.int/data/gho/data/indicators/i...,1970,1970,1970,EMR,Eastern Mediterranean,PAK,Pakistan,SEX,SEX_FMLE,Female,162243.0,162 243 [150 180-175 188]
3,CM_02,Number of infant deaths,https://www.who.int/data/gho/data/indicators/i...,1968,1968,1968,EMR,Eastern Mediterranean,PAK,Pakistan,SEX,SEX_BTSX,Both sexes,341398.0,341 398 [316 125-367 516]
4,CM_01,Number of under-five deaths,https://www.who.int/data/gho/data/indicators/i...,1994,1994,1994,EMR,Eastern Mediterranean,PAK,Pakistan,SEX,SEX_BTSX,Both sexes,159104.0,159 104 [142 595-177 120]


| Column Name                          | Kya karein? | Reason                                                                                                                                             |
| ------------------------------------ | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| `GHO (CODE)`                         | ❌ **Drop**  | Yeh WHO internal code hota hai, analysis mein kaam ka nahi.                                                                                        |
| `GHO (DISPLAY)`                      | ❌ **Drop**  | Har row mein same value hoti hai: “Infant mortality rate (probability of dying between birth and age 1 per 1000 live births)”. Duplicate info hai. |
| `GHO (URL)`                          | ❌ **Drop**  | Sirf reference link hai, model ya graph mein use nahi hota.                                                                                        |
| `STARTYEAR` / `ENDYEAR`              | ❌ **Drop**  | Ye bhi `YEAR (DISPLAY)` ka duplicate type hain. Har row ka start-end same year hoga.                                                               |
| `REGION (CODE)` / `REGION (DISPLAY)` | ❌ **Drop**  | Har row mein same region hai: “Eastern Mediterranean” (Pakistan ka WHO region). No variation.                                                      |
| `COUNTRY (CODE)`                     | ❌ **Drop**  | Har row mein same code hoga (PAK), already have `COUNTRY (DISPLAY)`                                                                                |
| `COUNTRY (DISPLAY)`                  | ⚠️ Optional | Agar multiple countries hote to useful hota. Lekin agar sirf Pakistan hai to drop kar sakte ho.                                                    |
| `DIMENSION (TYPE)`                   | ❌ **Drop**  | Iska value mostly same hoga e.g., “Sex”, already included in `DIMENSION (NAME)`                                                                    |
| `DIMENSION (CODE)`                   | ❌ **Drop**  | Gender ka code hai (e.g. `SEX_BTSX`), but `DIMENSION (NAME)` is human-readable. Better for charts.                                                 |
| `Value`                              | ❌ **Drop**  | Duplicate of `Numeric`, ya string version. Tum already `Numeric` use kar rahe ho for analysis.                                                     |


In [10]:
# Rename columns for clarity
# (optional) Rename columns to more user-friendly names
data.rename(columns={
    'YEAR (DISPLAY)': 'Year',
    'DIMENSION (NAME)': 'Sex',
    'Numeric': 'Infant_Mortality'
}, inplace=True)
data.columns  # Display updated column names

Index(['GHO (CODE)', 'GHO (DISPLAY)', 'GHO (URL)', 'Year', 'STARTYEAR',
       'ENDYEAR', 'REGION (CODE)', 'REGION (DISPLAY)', 'COUNTRY (CODE)',
       'COUNTRY (DISPLAY)', 'DIMENSION (TYPE)', 'DIMENSION (CODE)', 'Sex',
       'Infant_Mortality', 'Value'],
      dtype='object')

In [15]:
# Keep only relevant columns
data = data[['Year', 'Sex', 'Infant_Mortality']]

# Display the cleaned dataset
print(data.head())

# Encode 'Sex' (Male=0, Female=1)
data['Sex'] = data['Sex'].map({'Male': 0, 'Female': 1, 'Both sexes': 2})


         Year              Sex      Infant_Mortality
0  #date+year  #dimension+name  #indicator+value+num
1        1978       Both sexes         128.629383203
2        1970           Female              162243.0
3        1968       Both sexes              341398.0
4        1994       Both sexes              159104.0


In [23]:
# Dependent variable
y = data['Infant_Mortality']

# Independent variables
x = data[['Year', 'Sex']]


print(y.isnull().sum())

0


In [25]:
x = x.dropna()
y = y.loc[x.index]  # Keep target aligned

In [26]:
x = x.fillna(x.mean(numeric_only=True))

In [27]:
print(x.isnull().sum())

Year    0
Sex     0
dtype: int64


In [28]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [32]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Train Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)


In [34]:
# Evaluate
print("MSE:", mean_squared_error(y_test, y_pred))
print("R² Score:", r2_score(y_test, y_pred))

MSE: 269841502701.63858
R² Score: 0.01481736590056315


In [35]:
import xgboost as xgb
# 6. XGBoost Model banayein aur train karein
model = xgb.XGBRegressor(
    objective='reg:squarederror',  # Regression ke liye
    n_estimators=100,              # Darakhto ki tadad
    learning_rate=0.1,             # Seekhne ki raftaar
    max_depth=3,                   # Darakht ki gehrai
    random_state=42
)
model.fit(x_train, y_train)

ModuleNotFoundError: No module named 'xgboost'

In [33]:
# Evaluate
print("MSE:", mean_squared_error(y_test, y_pred))
print("R² Score:", r2_score(y_test, y_pred))

MSE: 269841502701.63858
R² Score: 0.01481736590056315
