# Lung Cancer Risk Analysis – Data Cleaning & Preparation
*Sujal Adhikari*

In [75]:
###All the necessary libaries that we will be using in the project
import pandas as pd
import numpy as np 


In [76]:
uncleanedData = pd.read_csv('cancer.csv') ##This is the uncleaned data that needs to be cleaned and then made ready for the evaluation
uncleanedData.head(5)

Unnamed: 0,index,Patient Id,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,...,Fatigue,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring,Level
0,0,P1,33,1,2,4,5,4,3,2,...,3,4,2,2,3,1,2,3,4,Low
1,1,P10,17,1,3,1,5,3,4,2,...,1,3,7,8,6,2,1,7,2,Medium
2,2,P100,35,1,4,5,6,5,5,4,...,8,7,9,2,1,4,6,7,2,High
3,3,P1000,37,1,7,7,7,7,6,7,...,4,2,3,1,4,5,6,7,5,High
4,4,P101,46,1,6,8,7,7,7,6,...,3,2,4,1,4,2,4,2,3,High


In [85]:
print(f"The data has {uncleanedData.shape[0]} rows and {uncleanedData.shape[1]} columns")

The data has 1000 rows and 26 columns


In [77]:
### Lets see the columns and its data types 
print(uncleanedData.dtypes)

index                        int64
Patient Id                  object
Age                          int64
Gender                       int64
Air Pollution                int64
Alcohol use                  int64
Dust Allergy                 int64
OccuPational Hazards         int64
Genetic Risk                 int64
chronic Lung Disease         int64
Balanced Diet                int64
Obesity                      int64
Smoking                      int64
Passive Smoker               int64
Chest Pain                   int64
Coughing of Blood            int64
Fatigue                      int64
Weight Loss                  int64
Shortness of Breath          int64
Wheezing                     int64
Swallowing Difficulty        int64
Clubbing of Finger Nails     int64
Frequent Cold                int64
Dry Cough                    int64
Snoring                      int64
Level                       object
dtype: object


### Looking to the details we can easily classify that the data is well structured into proper datatypes

---


### Duplicated data should be detected and classified in order to avoid the false insights later duing EDA

In [78]:
### Lets check if there are any duplicated values or not 
uniquePatients = uncleanedData['Patient Id'].nunique()
totalPatients = len(uncleanedData['Patient Id'])

print(uniquePatients == totalPatients)
### This proves that there are no case of patients data being repeated or overlapped which is good for the analyis 

True


---

### The gender column has been changed from 1 and 2 to Male and Female for proper analysis and the new column named 'GenderBinary' has been added in order to work with Machine Learning Model later on/

In [79]:
uncleanedData['Gender'] = uncleanedData['Gender'].map({1:'Male',2:'Female'})
uncleanedData['GenderBinary'] = uncleanedData['Gender'].map({'Male':1,'Female':0})

---
### Lets check the missing values and work on them !

In [80]:
### Missing Values Check!
if uncleanedData.isna().sum().sum() == 0:
    print("There are no missing values")
else:
    print('There are missing values')


There are no missing values


In [81]:
### Since we dont need the index column and Patient ID column it is safe for dropping them 
uncleanedData = uncleanedData.drop(columns=['Patient Id'])

In [82]:
uncleanedData.head(5)

Unnamed: 0,index,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,Balanced Diet,...,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring,Level,GenderBinary
0,0,33,Male,2,4,5,4,3,2,2,...,4,2,2,3,1,2,3,4,Low,1
1,1,17,Male,3,1,5,3,4,2,2,...,3,7,8,6,2,1,7,2,Medium,1
2,2,35,Male,4,5,6,5,5,4,6,...,7,9,2,1,4,6,7,2,High,1
3,3,37,Male,7,7,7,7,6,7,7,...,2,3,1,4,5,6,7,5,High,1
4,4,46,Male,6,8,7,7,7,6,7,...,2,4,1,4,2,4,2,3,High,1


### The rawData has been cleaned and is ready to be analyzed for further analysis. The 'uncleaned' data has been saved as a new csv named 'cleaned.csv' 

In [83]:
uncleanedData.to_csv('cleaned.csv')

# The dataset is now clean, with no missing values or duplicates, and unnecessary columns removed. We’ve saved the cleaned version as cleaned.csv — ready to explore some insights in the next step!