## Enhancing Hospital Efficiency with ML: Data Cleaning, XGBoost, and Predictive Modeling.
In the ever-evolving landscape of healthcare, the quest to enhance patient care outcomes and elevate the quality of healthcare services is an ongoing mission. Healthcare organizations are navigating a complex web of challenges, but within these challenges, lies a remarkable opportunity – the power of data.

Healthcare analytics is the compass that guides organizations towards this opportunity. It's the art of dissecting and interpreting data, employing both quantitative and qualitative techniques to unveil the hidden gems of insights and patterns within the wealth of healthcare information. Among the multitude of metrics used for performance evaluation, one vital indicator stands out - the Length of Stay (LOS) for patients.

Predicting a patient's Length of Stay is akin to holding a key that unlocks a world of possibilities. It empowers hospitals to optimize their treatment plans with precision, a measure that not only reduces LOS but also minimizes infection rates among patients, staff, and visitors. In essence, it's a pathway to not just improving patient care but revolutionizing healthcare management as a whole.

In this journey towards better healthcare, you play a pivotal role. You are the data virtuoso, armed with the latest tools and techniques in healthcare analytics. Your mission is to transform raw data into meaningful insights that illuminate the intricate web of patient care. Through the lens of data analysis, you decode the mysteries of patient LOS, revealing trends and patterns that hold the key to more efficient and effective healthcare delivery.

Collaborating closely with the healthcare team, you craft compelling data visualizations that bring these insights to life. Your data-driven creations become the guiding stars, steering healthcare professionals towards better decision-making, enhanced care, and safer environments. While the intricacies of your work may often go unnoticed, its impact reverberates throughout the healthcare organization.

In the world of healthcare analytics, you are the unsung hero, the one who helps unveil the extraordinary stories of improved patient care and streamlined healthcare management. Your dedication to data and your ability to transform it into illuminating insights contribute to the ongoing saga of healthcare excellence, making every patient's journey towards better health that much more extraordinary.

## Module 1
### Task 1: Analyzing the train data.
In a workspace, an data scientist initiated "train.csv" with a distinct sense of purpose. This dataset, an intricate tapestry of numbers and records, held the potential to unveil a concealed revelation. The task at hand was far more than routine; it stood as a pivotal endeavor to uncover the essential component in a pioneering scientific inquiry. Within each row and column, lay the capacity to redefine an entire domain of knowledge. With each scrutinized line, the motivation to unearth the elusive truth intensified, as the underlying reason for this undertaking gradually unveiled itself.

In [1]:
#--- Import Pandas ---
import pandas as pd
#--- Read in dataset ----
train = pd.read_csv("train.csv")

# ---WRITE YOUR CODE FOR TASK 1 ---
#--- Inspect data ---
train.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50


### Task 2: Decoding the test data.
In the dim glow of the computer screen, the data scientist continued the intricate puzzle of data analysis. Having meticulously deciphered "train.csv," the next chapter had arrived: "test.csv." The two datasets were like twin siblings, sharing almost identical traits, but for one crucial distinction - a lone, enigmatic column to be predicted.

In [2]:
#--- Read in dataset ----
test = pd.read_csv("test.csv")

# ---WRITE YOUR CODE FOR TASK 2 ---
#--- Inspect data ---
test.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0


### Task 3: Navigating the Missing Values.
As the data scientist delved into the dataset, a new challenge emerged—missing values. The code snippet "null_values_train" unveiled the extent of these gaps within "train.csv." Each missing value represented a hidden aspect of the data that begged to be uncovered.

This task was more than code; it was a quest for the data scientist to unearth hidden knowledge. The null values were the enigmatic pieces in a grand puzzle, vital to the project's success. The data scientist was resolute, devising strategies to address these gaps and ensure the dataset's completeness. With every null value tackled, the path to insights became clearer, one step closer to the ultimate goal.

In [3]:
# --- WRITE YOUR CODE FOR TASK 3 ---
null_values_train = train[train.isnull()]

#--- Inspect data ---
null_values_train

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,,,,,,,,,,,,,,,,,,
4996,,,,,,,,,,,,,,,,,,
4997,,,,,,,,,,,,,,,,,,
4998,,,,,,,,,,,,,,,,,,


### Task 4: Conquering the Enigma of Missing Values.
With a sense of unwavering dedication, the data scientist shifted their focus to the "test.csv" dataset, ready to face another formidable challenge. The code snippet "null_values_test" unveiled a fresh perspective on the missing values within this new dataset. Each null value represented a potential hurdle, an element of the story that was yet to be fully understood.

In [4]:
# --- WRITE YOUR CODE FOR TASK 4 ---
null_values_test = test[test.isnull()]

#--- Inspect data ---
null_values_test

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,,,,,,,,,,,,,,,,,
1996,,,,,,,,,,,,,,,,,
1997,,,,,,,,,,,,,,,,,
1998,,,,,,,,,,,,,,,,,


### Task 5: Data Healing.
With each line of code executed, the data scientist meticulously filled in the missing pieces of the puzzle. The "Bed Grade" column, which held information critical to the project, saw its gaps mended. By replacing the NA values with the mode of the column, the data scientist ensured that this aspect of the data was now complete and ready for analysis.

The same diligent approach was applied to the "City_Code_Patient" column, both in the training and test datasets. With missing values replaced by the mode, the data scientist breathed life back into these columns, making the data more robust and insightful. The journey to harness the full potential of the datasets continued, as these enhancements brought the data scientist closer to the moment of revelation.

In [5]:
#--- WRITE YOUR CODE FOR TASK 5 ---
train['Bed Grade'] = train['Bed Grade'].fillna(train['Bed Grade'].mode()[0])
train['City_Code_Patient'] = train['City_Code_Patient'].fillna(train['City_Code_Patient'].mode()[0])


#--- Inspect data ---
print(train['Bed Grade'].isna().sum())
print(train['City_Code_Patient'].isna().sum())

0
0


### Task 6: Data Completeness.
With precision and dedication, the data scientist extended their data preparation process to the "test.csv" dataset. Just as they had done for the training data, the missing values in the "Bed Grade" and "City_Code_Patient" columns were diligently addressed.

By filling in these gaps with the mode of their respective columns, the data scientist ensured that the test data was now equally robust and well-prepared. The two datasets were now harmonized, ready to undergo analysis, and the stage was set for the critical next steps in the project.

In [6]:
#--- WRITE YOUR CODE FOR TASK 6 ---
test['Bed Grade'] = test['Bed Grade'].fillna(test['Bed Grade'].mode()[0])
test['City_Code_Patient'] = test['City_Code_Patient'].fillna(test['City_Code_Patient'].mode()[0])


#--- Inspect data ---
print(test['Bed Grade'].isna().sum())
test['City_Code_Patient'].isna().sum()

0


0

## Module 2
### Task 1: Transforming 'Stay' with LabelEncoder.
In the data scientist's relentless pursuit of knowledge, they ventured into the realm of feature transformation. The code snippet employing the LabelEncoder was a pivotal step in the journey. The "Stay" column, a critical factor in the dataset, was now transformed into numerical values for the machine to comprehend.

This transformation was essential to enable predictive modeling and machine learning algorithms to make sense of the data. The data scientist watched as the "Stay" column, once a collection of varied labels, became a structured numerical representation, opening the door to new possibilities in analysis.

In [7]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#--- WRITE YOUR CODE FOR TASK 1 ---
le.fit(train['Stay'])

# Transform the data using the LabelEncoder object
train['Stay'] = le.transform(train['Stay'])
#--- Inspect data ---
train['Stay']

0       0
1       4
2       3
3       4
4       4
       ..
4995    0
4996    3
4997    5
4998    2
4999    1
Name: Stay, Length: 5000, dtype: int64

### Task 2: Charting the Unknown.
In the next chapter of the data scientist's story, they turned their attention to the "test.csv" dataset. Here, they encountered an intriguing challenge. The "Stay" column, vital for the project, was initially left blank. Without missing a beat, the data scientist boldly assigned a default value of -1 to this column, setting the stage for further data manipulation and analysis.

The reasons for this move lay hidden within the algorithmic realm, a placeholder to be eventually replaced by machine-generated predictions. As the data scientist executed this change, it symbolized the start of a new phase in their mission – the one where the algorithm would craft predictions and, in turn, determine the lengths of patients' stays.

In [8]:
#--- WRITE YOUR CODE FOR TASK 2 ---
test['Stay'] = -1
test


Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0,-1
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0,-1
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0,-1
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0,-1
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,320434,26,b,2,Y,1,anesthesia,R,D,2.0,19303,8.0,Trauma,Extreme,4,51-60,8829.0,-1
1996,320435,26,b,2,Y,4,gynecology,R,D,1.0,19303,8.0,Trauma,Extreme,6,51-60,3507.0,-1
1997,320436,23,a,6,X,3,gynecology,Q,F,1.0,19303,8.0,Trauma,Extreme,3,51-60,4109.0,-1
1998,320437,25,e,1,X,4,gynecology,Q,E,3.0,19303,8.0,Emergency,Extreme,4,51-60,4155.0,-1


### Task 3: Data Convergence.
As the data scientist continued their quest for insights, they faced a pivotal moment in their journey. The merging of the "train" and "test" datasets marked a significant milestone. By creating a single, consolidated dataset named "df," they brought together the entirety of their data resources.

In [9]:
#--- WRITE YOUR CODE FOR TASK 3 ---
df = pd.concat([train,test], ignore_index = True)
#--- Inspect data ---
print(len(train))
print(len(test))
df

5000
2000


Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,4
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,3
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,4
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6995,320434,26,b,2,Y,1,anesthesia,R,D,2.0,19303,8.0,Trauma,Extreme,4,51-60,8829.0,-1
6996,320435,26,b,2,Y,4,gynecology,R,D,1.0,19303,8.0,Trauma,Extreme,6,51-60,3507.0,-1
6997,320436,23,a,6,X,3,gynecology,Q,F,1.0,19303,8.0,Trauma,Extreme,3,51-60,4109.0,-1
6998,320437,25,e,1,X,4,gynecology,Q,E,3.0,19303,8.0,Emergency,Extreme,4,51-60,4155.0,-1


### Task 4: Transforming Categories into Numbers.
With each loop iteration, the data scientist's journey into data transformation continued to unfold. They recognized the importance of encoding categorical variables, as it was a pivotal step in preparing the data for machine learning algorithms. For every feature in the list, including "Hospital_type_code," "Department," "Age," and more, the LabelEncoder was brought into play.

This transformation turned the categorical variables into numerical representations, providing a common language for the algorithms to interpret. It was as if these once-diverse variables were now speaking a unified numerical dialect.

In [10]:
#--- WRITE YOUR CODE FOR TASK 4 ---
categorical_features = ['Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness','Age']

# Initialize LabelEncoder
le = LabelEncoder()
print(df[['Hospital_type_code', 'Department', 'Age']])
# Transform each categorical feature
for feature in categorical_features:
    df[feature] = le.fit_transform(df[feature])

#--- Inspect data ---
df[['Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness','Age']]

     Hospital_type_code    Department    Age
0                     c  radiotherapy  51-60
1                     c  radiotherapy  51-60
2                     e    anesthesia  51-60
3                     b  radiotherapy  51-60
4                     b  radiotherapy  51-60
...                 ...           ...    ...
6995                  b    anesthesia  51-60
6996                  b    gynecology  51-60
6997                  a    gynecology  51-60
6998                  e    gynecology  51-60
6999                  e    gynecology  51-60

[7000 rows x 3 columns]


Unnamed: 0,Hospital_type_code,Hospital_region_code,Department,Ward_Type,Ward_Facility_Code,Type of Admission,Severity of Illness,Age
0,2,2,3,2,5,0,0,5
1,2,2,3,3,5,1,0,5
2,4,0,1,3,4,1,0,5
3,1,1,3,2,3,1,0,5
4,1,1,3,3,3,1,0,5
...,...,...,...,...,...,...,...,...
6995,1,1,1,2,3,1,0,5
6996,1,1,2,2,3,1,0,5
6997,0,0,2,1,5,1,0,5
6998,4,0,2,1,4,0,0,5


## Module 3
### Task 1: Data Segmentation.
In the ever-evolving data scientist's tale, a pivotal moment arrived. The "train" dataset, known as "df" from earlier, underwent a transformation. With the code snippet, the data scientist carefully filtered the data, creating a refined version where "Stay" values did not equal -1. This decision marked the separation of data meant for training and data that would be used for predictions.

The "train" dataset now held only the rows where the outcomes were known, and it was this dataset that would be the cornerstone for developing and validating predictive models. It was a focused, purpose-driven dataset where patterns and insights would be uncovered to guide the project.

In [11]:
#--- WRITE YOUR CODE FOR TASK 1 ---
train = df[df['Stay'] != -1]

#--- Inspect data ---
train

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,2,3,2,3,3,2,5,2.0,31397,7.0,0,0,2,5,4911.0,0
1,2,2,2,5,2,2,3,3,5,2.0,31397,7.0,1,0,2,5,5954.0,4
2,3,10,4,1,0,2,1,3,4,2.0,31397,7.0,1,0,2,5,4745.0,3
3,4,26,1,2,1,2,3,2,3,2.0,31397,7.0,1,0,2,5,7272.0,4
4,5,26,1,2,1,2,3,3,3,2.0,31397,7.0,1,0,2,5,5558.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,19,0,7,1,2,2,3,2,2.0,104970,8.0,0,1,2,1,4894.0,0
4996,4997,26,1,2,1,2,2,1,3,4.0,104970,8.0,1,1,2,1,6987.0,3
4997,4998,32,5,9,1,3,2,3,1,2.0,68447,5.0,0,2,4,4,4196.0,5
4998,4999,26,1,2,1,3,2,2,3,2.0,68447,5.0,1,2,3,4,4560.0,2


### Task 2: Preparation for Prediction.
In the data scientist's ongoing quest for knowledge, another crucial step unfolded. The "test" dataset, born from the depths of the "df" dataset, was carved out with precision. The code snippet effortlessly separated data where "Stay" values equaled -1, signifying the portion of the dataset meant for predictions.

This division marked the birth of the "test" dataset, where the outcome was still shrouded in mystery and yet to be revealed. It was this dataset that would soon become the canvas for the predictive models, where the machine would strive to unlock the riddles of patient stay durations.

In [12]:
#--- WRITE YOUR CODE FOR TASK 2 ---
test = df[df['Stay']==-1]

#--- Inspect data ---
test

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
5000,318439,21,2,3,2,3,2,3,0,2.0,17006,2.0,0,2,2,7,3095.0,-1
5001,318440,29,0,4,0,2,2,3,5,2.0,17006,2.0,1,2,4,7,4018.0,-1
5002,318441,26,1,2,1,3,2,1,3,4.0,17006,2.0,0,2,3,7,4492.0,-1
5003,318442,6,0,6,0,3,2,1,5,2.0,17006,2.0,1,2,3,7,4173.0,-1
5004,318443,28,1,11,0,2,2,2,5,2.0,17006,2.0,1,2,4,7,4161.0,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6995,320434,26,1,2,1,1,1,2,3,2.0,19303,8.0,1,0,4,5,8829.0,-1
6996,320435,26,1,2,1,4,2,2,3,1.0,19303,8.0,1,0,6,5,3507.0,-1
6997,320436,23,0,6,0,3,2,1,5,1.0,19303,8.0,1,0,3,5,4109.0,-1
6998,320437,25,4,1,0,4,2,1,4,3.0,19303,8.0,0,0,4,5,4155.0,-1


### Task 3: Feature Engineering for Enhanced Predictive Analysis.
As the data scientist's journey unfolded, a new chapter of feature engineering and data enrichment began. The code snippet provided a function, "get_countid_enocde," designed to extract valuable information from the dataset.

Using this function, the data scientist grouped data by different combinations of features such as 'patientid,' 'Hospital_region_code,' and 'Ward_Facility_Code' to calculate counts and then merged these counts back into the datasets for both training and testing data. This process created new features, such as 'count_id_patient' and 'count_id_patient_hospitalCode,' that reflected the frequency of each combination, offering deeper insights into the data.

With each transformation, the data scientist was enhancing the dataset, preparing it for the predictive modeling phase. The result was a dataset, "test1," meticulously refined, and now poised for predictive analysis, with features engineered to uncover hidden patterns in patient stays.

In [13]:
import numpy as np
def get_countid_enocde(train, test, cols, name):
  temp = train.groupby(cols)['case_id'].count().reset_index().rename(columns = {'case_id': name})
  temp2 = test.groupby(cols)['case_id'].count().reset_index().rename(columns = {'case_id': name})
  train = pd.merge(train, temp, how='left', on= cols)
  test = pd.merge(test,temp2, how='left', on= cols)
  train[name] = train[name].astype('float')
  test[name] = test[name].astype('float')
  train[name].fillna(np.median(temp[name]), inplace = True)
  test[name].fillna(np.median(temp2[name]), inplace = True)
  return train, test



# Uncomment the code below when running this task
train, test = get_countid_enocde(train, test, ['patientid'], name = 'count_id_patient')
train, test = get_countid_enocde(train, test,
                                 ['patientid', 'Hospital_region_code'], name = 'count_id_patient_hospitalCode')
train, test = get_countid_enocde(train, test,
                                 ['patientid', 'Ward_Facility_Code'], name = 'count_id_patient_wardfacilityCode')
#--- WRITE YOUR CODE FOR TASK 3 ---
test1 = test.drop(['Stay', 'patientid', 'Hospital_region_code','Ward_Facility_Code'], axis =1)

#--- Inspect data ---
test1

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Available Extra Rooms in Hospital,Department,Ward_Type,Bed Grade,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,count_id_patient,count_id_patient_hospitalCode,count_id_patient_wardfacilityCode
0,318439,21,2,3,3,2,3,2.0,2.0,0,2,2,7,3095.0,7.0,1.0,1.0
1,318440,29,0,4,2,2,3,2.0,2.0,1,2,4,7,4018.0,7.0,4.0,4.0
2,318441,26,1,2,3,2,1,4.0,2.0,0,2,3,7,4492.0,7.0,2.0,2.0
3,318442,6,0,6,3,2,1,2.0,2.0,1,2,3,7,4173.0,7.0,4.0,4.0
4,318443,28,1,11,2,2,2,2.0,2.0,1,2,4,7,4161.0,7.0,4.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,320434,26,1,2,1,1,2,2.0,8.0,1,0,4,5,8829.0,5.0,2.0,2.0
1996,320435,26,1,2,4,2,2,1.0,8.0,1,0,6,5,3507.0,5.0,2.0,2.0
1997,320436,23,0,6,3,2,1,1.0,8.0,1,0,3,5,4109.0,5.0,3.0,1.0
1998,320437,25,4,1,4,2,1,3.0,8.0,0,0,4,5,4155.0,5.0,3.0,2.0


### Task 4: Sculpting the Data.
The "train1" dataset emerged as the refined version of the training data. With precision, the data scientist executed the code to eliminate certain columns, including 'case_id,' 'patientid,' 'Hospital_region_code,' and 'Ward_Facility_Code.' These exclusions were made to streamline the dataset for predictive modeling.

By removing these variables, the data scientist aimed to reduce noise and focus on the most relevant features for predicting patient stays. The "train1" dataset now stood as a lean, purpose-built platform for the upcoming modeling stage, where machine learning algorithms would uncover the intricate patterns within the data, taking the project one step closer to its conclusion.

In [14]:
#--- WRITE YOUR CODE FOR TASK 4 ---
columns_to_drop = ['case_id', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code']

# Drop these columns from the DataFrame
train1 = train.drop(columns=columns_to_drop)
#--- Inspect data ---
print(train1.columns)
train1

Index(['Hospital_code', 'Hospital_type_code', 'City_Code_Hospital',
       'Available Extra Rooms in Hospital', 'Department', 'Ward_Type',
       'Bed Grade', 'City_Code_Patient', 'Type of Admission',
       'Severity of Illness', 'Visitors with Patient', 'Age',
       'Admission_Deposit', 'Stay', 'count_id_patient',
       'count_id_patient_hospitalCode', 'count_id_patient_wardfacilityCode'],
      dtype='object')


Unnamed: 0,Hospital_code,Hospital_type_code,City_Code_Hospital,Available Extra Rooms in Hospital,Department,Ward_Type,Bed Grade,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay,count_id_patient,count_id_patient_hospitalCode,count_id_patient_wardfacilityCode
0,8,2,3,3,3,2,2.0,7.0,0,0,2,5,4911.0,0,14.0,4.0,5.0
1,2,2,5,2,3,3,2.0,7.0,1,0,2,5,5954.0,4,14.0,4.0,5.0
2,10,4,1,2,1,3,2.0,7.0,1,0,2,5,4745.0,3,14.0,4.0,2.0
3,26,1,2,2,3,2,2.0,7.0,1,0,2,5,7272.0,4,14.0,6.0,3.0
4,26,1,2,2,3,3,2.0,7.0,1,0,2,5,5558.0,4,14.0,6.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,19,0,7,2,2,3,2.0,8.0,0,1,2,1,4894.0,0,3.0,3.0,2.0
4996,26,1,2,2,2,1,4.0,8.0,1,1,2,1,6987.0,3,3.0,3.0,1.0
4997,32,5,9,3,2,3,2.0,5.0,0,2,4,4,4196.0,5,3.0,2.0,1.0
4998,26,1,2,3,2,2,2.0,5.0,1,2,3,4,4560.0,2,3.0,2.0,1.0


## Module 4
### Task 1: Data Splitting for Model Mastery.
With a sense of purpose, the data scientist transitioned to the realm of model preparation and evaluation. The code snippet utilizing "train_test_split" played a pivotal role in splitting the "train1" dataset into two distinct subsets - one for training and the other for testing.

The features, designated as 'X1,' and the target variable, 'y1' (in this case, 'Stay'), were segregated. The split into 'X_train' and 'X_test' along with their corresponding 'y_train' and 'y_test' counterparts was carried out to create a controlled environment for model development and evaluation.

This division of data was a fundamental step in the process, essential for training and assessing the performance of machine learning models. With the training and testing datasets now established, the data scientist was poised to embark on the final phases of their data-driven journey, where predictive modeling would unveil insights, guide decision-making, and bring the project to its ultimate conclusion.

In [15]:
from sklearn.model_selection import train_test_split

# --- WRITE YOUR CODE FOR TASK 1 ---
X1 = train1.drop('Stay', axis=1)  # Features
y1 = train1['Stay']
X_train, X_test, y_train, y_test = train_test_split(X1,y1, test_size=0.2, random_state=42)

#--- Inspect data ---
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (4000, 16)
X_test shape: (1000, 16)
y_train shape: (4000,)
y_test shape: (1000,)


### Task 2: The XGBoost Model Story.
In the final stages of their data scientist's journey, the spotlight shone on predictive modeling. The XGBoost algorithm, a powerful machine learning tool, was enlisted to unlock the hidden patterns within the data. With a meticulously crafted configuration of hyperparameters, including 'max_depth,' 'learning_rate,' and 'n_estimators,' the XGBoost classifier was primed for action.

With precision, the classifier was trained on the "X_train" and "y_train" datasets, learning from the patterns embedded within the training data. Once the model was honed, it embarked on the critical phase of prediction. 'X_test' became the testing ground where the model's capabilities were put to the test.

The accuracy score was the final judgment, quantifying the model's performance. With an accuracy score rounded to two decimal places, the data scientist had a clear measure of the model's ability to predict patient stays, bringing the project to its ultimate conclusion.


In [16]:
import xgboost
from sklearn.metrics import accuracy_score


classifier_xgb = xgboost.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=800,
                                  objective='multi:softmax', reg_alpha=0.5, reg_lambda=1.5,
                                  booster='gbtree', n_jobs=4, min_child_weight=2, base_score= 0.75)

# --- WRITE YOUR CODE FOR TASK 2 ---
acc_score_xgb = classifier_xgb.fit(X_train, y_train)
y_pred = classifier_xgb.predict(X_test)


#--- Inspect data ---
comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison_df)
accuracy = accuracy_score(y_test, y_pred)
accuracy

ModuleNotFoundError: No module named 'xgboost'

## Module 1
### Task 1: Analyzing the train data.
In a workspace, an data scientist initiated "train.csv" with a distinct sense of purpose. This dataset, an intricate tapestry of numbers and records, held the potential to unveil a concealed revelation. The task at hand was far more than routine; it stood as a pivotal endeavor to uncover the essential component in a pioneering scientific inquiry. Within each row and column, lay the capacity to redefine an entire domain of knowledge. With each scrutinized line, the motivation to unearth the elusive truth intensified, as the underlying reason for this undertaking gradually unveiled itself.

In [73]:
#--- Import Pandas ---
import pandas as pd
#--- Read in dataset ----
train = pd.read_csv("train.csv")

# ---WRITE YOUR CODE FOR TASK 1 ---
#--- Inspect data ---
train.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50


### Task 2: Decoding the test data.
In the dim glow of the computer screen, the data scientist continued the intricate puzzle of data analysis. Having meticulously deciphered "train.csv," the next chapter had arrived: "test.csv." The two datasets were like twin siblings, sharing almost identical traits, but for one crucial distinction - a lone, enigmatic column to be predicted.

In [74]:
#--- Read in dataset ----
test = pd.read_csv("test.csv")

# ---WRITE YOUR CODE FOR TASK 2 ---
#--- Inspect data ---
test.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0


### Task 3: The Final Predictive Model Transformation.
In the final chapter of this data science journey, the predictive model's capabilities were put into action for real-world predictions. Using the trained XGBoost classifier, the data scientist made predictions on the "test1" dataset. These predictions were based on the features in "test1" and stored as "pred_xgb."

The data scientist's meticulous work continued as they assembled the results. A DataFrame, "result_xgb," was crafted, capturing the case ID of each prediction alongside the corresponding "Stay" values.

To enhance the readability and understanding of the results, the data scientist mapped the numeric labels of "Stay" to their corresponding meaningful categories. The predictions now represented patient stays in terms of the actual durations, making them more accessible and valuable for practical use.

In [None]:
columns_to_drop = ['patientid', 'Hospital_region_code', 'Ward_Facility_Code','Type of Admission','Severity of Illness','Ward_Type']

temp = test1
temp

In [None]:
# --- WRITE YOUR CODE FOR TASK 3 ---

# Use the provided label mapping
label_mapping = {
    0: '0-10', 1: '11-20', 2: '21-30', 3: '31-40', 4: '41-50',
    5: '51-60', 6: '61-70', 7: '71-80', 8: '81-90', 9: '91-100',
    10: 'More than 100 Days'
}

# test1.drop(columns=columns_to_drop, inplace = True)

pred_xgb = classifier_xgb.predict(temp.drop(['case_id'], axis = 1))

result_xgb = pd.DataFrame({
    'case_id': temp['case_id'],
    'Stay': pred_xgb
})

result_xgb['Stay'] = result_xgb['Stay'].map(label_mapping)


#--- Inspect data ---
result_xgb

### Task 4: Decoding Patient Stays.
In the closing chapters of this data-driven saga, the data scientist's journey took a significant turn. With the "result_xgb" DataFrame, they embarked on an insightful exploration. The code snippet, with "groupby" and "nunique" operations, provided a bird's-eye view of the distribution of predicted patient stays.

Each category, ranging from '0-10' to 'More than 100 Days,' was meticulously examined. The "case_id" count for each group was tallied, offering a clear understanding of the predicted distribution of patient stays.

This information was not just a collection of numbers; it was a crucial piece of the puzzle. It provided a snapshot of how the predictive model envisioned the durations of patient stays, which could be pivotal for making informed decisions in a healthcare setting.

In [None]:
# --- WRITE YOUR CODE FOR TASK 4 ---
result = result_xgb.groupby('Stay')['case_id'].nunique().reset_index()

#--- Inspect data ---
# Rename the columns for clarity
result.columns = ['Stay', 'Count']

# Sort the result by the 'Stay' column for better visualization
result = result.sort_values('Stay')

# Display the result
print(result)