# **Project Name**    -



##### **Project Type**    - Classification
##### **Contribution**    - Team
##### **Team Member 1 -** Taniya
##### **Team Member 2 -** Vaishnavi Singh
##### **Team Member 3 -** Vibhor Verma
##### **Team Member 4 -** Vidhan Gupta

# **Project Summary -**

Project Summary: Medical Diagnosis Dataset

Objective:
The objective of this project was to develop a predictive model to assist medical professionals in diagnosing a specific medical condition based on patient data collected from various clinical tests and examinations.

Dataset Description:
The dataset consisted of anonymized patient records containing a variety of features relevant to the diagnosis of the medical condition. These features included demographic information (age, gender), clinical measurements (blood pressure, heart rate), laboratory test results (blood tests, imaging studies), and symptoms reported by the patient.

Approach:

Data Preprocessing: The dataset underwent preprocessing steps, including handling missing values, encoding categorical variables, and scaling numerical features.

Exploratory Data Analysis (EDA): Exploratory data analysis was conducted to gain insights into the distribution of features, identify correlations, and understand the characteristics of the data.

Feature Engineering: Feature engineering techniques were applied to extract relevant information from the dataset and create new features that could potentially improve the predictive performance of the model.

Model Development: Various machine learning algorithms were trained and evaluated using cross-validation techniques. Hyperparameter tuning was performed to optimize the performance of the models.

Model Evaluation: The performance of the models was evaluated using metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic (ROC) curve.

Deployment: The best-performing model was selected for deployment, and recommendations were made for integrating the model into clinical practice, considering ethical and regulatory requirements.

Challenges:

Dealing with imbalanced data.
Handling missing values and outliers.
Selecting informative features for model development.
Ensuring interpretability and explainability of the predictive model.
Results:
The developed predictive model demonstrated promising performance in diagnosing the medical condition, achieving high accuracy and other relevant evaluation metrics. The model provides a valuable decision-support tool for medical professionals, potentially leading to improved patient outcomes and more efficient healthcare delivery.

# **GitHub Link -**

Provide your GitHub Link here:-


# **Problem Statement**


The healthcare sector has witnessed significant reflections of the overall economy in recent decades, with a particular focus on medical diagnosis datasets. Analyzing the diverse facets of this dataset is crucial for both healthcare practitioners and researchers. In this project, we delve into various use cases, exploring different dimensions of medical diagnoses. This analysis not only unveils meaningful relationships between attributes but also empowers us to conduct independent research and derive valuable insights.
Data analysis on extensive medical diagnosis datasets holds paramount importance, offering valuable insights for the entire healthcare ecosystem. Our focus encompasses understanding the distribution of diagnoses based on factors such as patient demographics, medical conditions, diagnostic methods, and other pertinent elements. Through this exploration, we aim to unravel patterns, correlations, and trends within the dataset, enabling a comprehensive understanding of the diverse factors influencing medical diagnoses.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv('/content/healthcare_dataset.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.describe()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of rows: ",len(df))
print("Number of columns: ",len(df.columns))

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values


### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
print("Unique values in the 'Variable' column:")
print(unique_values)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Plot of patients based on sex
df['Gender'].value_counts().plot(kind='bar', color='orange', ec="black")
plt.title("Chart of male and female Patients")
plt.xlabel("Gender")
plt.ylabel("Count")
for index, value in enumerate(df['Gender'].value_counts()):
    plt.text(index, value + 0.1, str(value), ha='center', va='bottom')
plt.show()

##### 1. Why did you pick the specific chart?

*Answer= I choose a bar chart for this data because it's effective for displaying the count of categories, in this case, male and female patients. The bars make it easy to compare the number of males and females, and the color contrast (orange bars with black outlines) makes the chart visually appealing and easy to interpret.

##### 2. What is/are the insight(s) found from the chart?

Answer= we can gather insights into the distribution of genders among patients. By observing the heights of the bars, we can see which gender category has a higher count. This insight can be valuable for various purposes, such as understanding the demographics of patients, targeting specific healthcare interventions or services towards certain genders, or identifying potential biases in the patient population.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer
Yes, the insights gained from understanding the gender distribution among patients can indeed lead to positive business impacts in several ways:

Targeted Marketing and Services: Knowing the gender distribution can help tailor marketing efforts and healthcare services to better meet the needs of specific demographics. For instance, if there's a higher proportion of females, services related to women's health could be expanded or emphasized.

Resource Allocation: Understanding the gender distribution can assist in resource allocation within the healthcare facility. For example, if there's a significant difference in the number of male and female patients, staffing levels and resource allocation can be adjusted accordingly.

Improving Patient Experience: Tailoring services to match the demographics of the patient population can lead to a more positive experience for patients. This can include everything from the design of facilities to the availability of certain healthcare services.

#### Chart - 2

In [None]:
# Age distribution
plt.figure(figsize=(8, 6))
sns.histplot(df['Age'], bins=20, kde=True, color='skyblue')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. a histogram is a common and effective choice when you want to visualize the distribution of a continuous numerical variable like age. It provides a clear and concise representation of the data's frequency distribution.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
general insights that you might gain from examining the age distribution histogram:

Central Tendency: Check where the histogram has its peak (mode). This can give you an idea of the most common age or age range in your dataset.

Spread of Ages: Observe the width and shape of the distribution. A wider distribution might indicate a broader range of ages in your dataset.

Skewness: If the histogram is asymmetric, it might indicate skewness in the age distribution. A longer tail on one side could suggest that the data is skewed in that direction.

Bimodal or Multimodal Patterns: If there are multiple peaks, it might indicate the presence of distinct groups or patterns within the age distribution.

Outliers: Look for any unusual spikes or isolated bars that might represent outliers or specific age groups that are overrepresented or underrepresented.

General Age Patterns: Depending on your dataset, you might observe specific patterns related to age, such as a large proportion of young or older individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here= while insights from age distribution analysis can lead to positive business impact, businesses must also be mindful of potential pitfalls and actively address challenges to sustain growth and competitiveness in the long term. It's essential to use age distribution data as part of a comprehensive strategy that considers broader market dynamics and evolving consumer trends.

#### Chart - 3

In [None]:
# Medical Condition distribution
plt.figure(figsize=(12, 8))
df['Medical Condition'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90, colors=sns.color_palette('pastel'))
plt.title("Medical Condition Distribution")
plt.show()

#####  1. Why did you pick the specific chart?

Answer Here It's important to note that while pie charts can be effective for certain types of data, they may not be the best choice in all situations. Bar charts or stacked bar charts could be alternatives, especially if you have a larger number of categories or if you want to emphasize comparisons between specific conditions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here =
general insights that you might gain from analyzing the 'Medical Condition' distribution pie chart:

Dominant Medical Conditions: Identify the most prevalent medical conditions by looking at the largest slices of the pie. This information can be valuable for resource allocation, healthcare planning, or targeted medical interventions.

Rare or Uncommon Conditions: Evaluate whether there are any medical conditions that constitute a small percentage of the pie. Understanding these less common conditions might be crucial for specialized care or research.

Proportionate Distribution: Assess the overall balance in the distribution. A well-balanced pie chart with fairly equal slices suggests a diverse distribution of medical conditions, while a skewed distribution may indicate a more concentrated prevalence of certain conditions.

Communication to Stakeholders: The pie chart can be a useful visual aid for communicating the distribution of medical conditions to stakeholders who may not be familiar with detailed numerical data. It simplifies complex information for easier comprehension.

Monitoring Changes Over Time: If you have data across different time periods, you can use the pie chart to observe changes in the distribution of medical conditions. Shifts in the proportions may indicate trends or changes in the population's health.

Identification of Critical Conditions: Highlighting critical or more severe medical conditions can help prioritize healthcare initiatives, research efforts, and resource allocation based on the prevalence of certain conditions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here Positive Business Impact:

Targeted Healthcare Services: Understanding the distribution of medical conditions allows healthcare providers to offer more targeted and specialized services. Tailoring healthcare services to prevalent conditions can enhance patient care and satisfaction.

Resource Allocation: Insights from the pie chart can guide resource allocation. Healthcare facilities can allocate resources such as medical staff, equipment, and medication based on the prevalence of specific medical conditions, improving operational efficiency.

Negative Business Impact:

Limited Market Scope: If a healthcare business heavily relies on addressing a specific medical condition and that condition has a relatively low prevalence, it may face challenges in market growth. Diversification or expanding services to address a broader range of conditions may be necessary.

Neglect of Emerging Conditions: A focus solely on prevalent conditions might result in overlooking emerging health issues or conditions that are gaining significance. Failure to adapt to changing healthcare trends could lead to missed opportunities and negative growth.

#### Chart - 4

In [None]:
#Date of Admission trends
plt.figure(figsize=(12, 6))
df['Date of Admission'] = pd.to_datetime(df['Date of Admission'])
df.set_index('Date of Admission').resample('M').size().plot(label='Monthly Admissions', color='purple')
plt.title('Monthly Admissions Trend')
plt.xlabel('Date of Admission')
plt.ylabel('Count')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. the time series line plot is chosen because it is well-suited for displaying the temporal evolution of monthly admissions, enabling you to gain insights into patterns, trends, and potential seasonality in the admission data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
eneral insights that you might gain from analyzing the 'Monthly Admissions Trend' time series line plot:

Overall Trend: Evaluate the general trend in monthly admissions. A rising trend may indicate an increase in demand for services, while a falling trend may suggest a decrease. This insight can be valuable for capacity planning and resource allocation.

Seasonal Patterns: Look for repeating patterns or seasonality in the data. Seasonal variations may be evident, with certain months experiencing consistently higher or lower admission counts. Understanding seasonality is crucial for effective resource management.

Anomalies or Outliers: Identify any unusual spikes or dips in the line plot. These anomalies could be due to specific events, holidays, or external factors affecting admission rates. Investigating these anomalies can provide insights into the factors influencing admissions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, the insights gained from analyzing the 'Monthly Admissions Trend' can contribute to creating a positive business impact, especially in the healthcare industry. Here are ways in which these insights can be valuable:

Capacity Planning: Understanding the overall trend in monthly admissions allows healthcare facilities to plan and allocate resources effectively. It helps in optimizing staffing levels, ensuring sufficient beds and medical supplies, and avoiding potential bottlenecks during peak admission periods.

Resource Optimization: Identifying seasonal patterns and anomalies helps in optimizing resources based on demand. Healthcare providers can adjust staffing levels, schedules, and resource distribution to match the expected variations in admission rates.

Improved Patient Experience: Anticipating trends in admissions allows healthcare facilities to enhance patient experience by ensuring timely and efficient service. Adequate staffing during busy periods can lead to shorter wait times and better overall patient satisfaction.
The insights gained from analyzing the monthly admissions trend have the potential to positively impact business operations, patient care, and strategic planning within the healthcare industry.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Age distribution of diabetic patients
new_df=df[df['Medical Condition']=="Diabetes"]
new=new_df.groupby('Age')['Name'].count().reset_index();
new.columns=["Age","Count"]
new
plt.scatter(new['Age'],new['Count'],color="orange",marker="+");
plt.xlabel("Age");
plt.ylabel("Count");
plt.title("Age Distribution of Diabetes patient")

##### 1. Why did you pick the specific chart?

Answer Here.Individual Data Points: A scatter plot is suitable when you want to visualize individual data points, making it effective for showing the distribution of diabetic patients across different age groups.

Quantitative Representation: The scatter plot allows for the quantitative representation of the count of diabetic patients at each age. Each point on the plot represents a specific age group along with the corresponding count.

Relationship between Variables: Scatter plots are excellent for examining relationships between two continuous variables, in this case, age (independent variable) and the count of diabetic patients (dependent variable).

Visualizing Patterns: The scatter plot helps in identifying patterns, clusters, or trends in the distribution of diabetic patients across different age groups. Patterns in the data may reveal insights into prevalence or concentration at specific ages.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
general insights that you might infer from analyzing the "Age Distribution of Diabetes Patients" scatter plot:

Prevalence Across Age Groups: Identify the age groups with the highest and lowest counts of diabetic patients. This insight can guide healthcare providers in targeting specific age groups for diabetes prevention and management programs.

Patterns or Trends: Examine the scatter plot for any discernible patterns or trends. Are there certain age ranges where the prevalence of diabetes appears to increase or decrease? Understanding such patterns can inform public health initiatives.

Outliers: Look for any points that deviate significantly from the general trend. Outliers may indicate unique characteristics in certain age groups that could be further investigated. For example, a sudden spike in diabetes cases in a specific age group may warrant additional scrutiny.

Age-Related Risk Factors: Consider whether the distribution aligns with known age-related risk factors for diabetes. Insights from the scatter plot can help healthcare professionals tailor preventive measures and screenings based on age-specific considerations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive Business Impact:

Targeted Healthcare Services: Insights into the age distribution of diabetic patients enable healthcare providers to offer more targeted and specialized services. Tailoring healthcare services to prevalent age groups can enhance patient care, leading to positive outcomes and patient satisfaction.

Preventive Measures: Identifying age groups with a higher prevalence of diabetes allows for the implementation of targeted preventive measures. Early detection, screening programs, and lifestyle interventions can contribute to better disease management and improved health outcomes.

Resource Allocation: The data can inform resource allocation strategies. Healthcare facilities can allocate resources, including medical staff, equipment, and educational materials, based on the age groups with higher diabetes prevalence, optimizing operational efficiency.

Potential Negative Impacts:

Resource Strain: If there is an imbalance in the age distribution, with a disproportionately higher prevalence of diabetes in certain age groups, it may strain healthcare resources. The increased demand for services in specific age groups could lead to resource shortages if not addressed.

Health Inequities: Identifying disparities in diabetes prevalence across age groups may highlight health inequities. Focusing exclusively on certain age demographics could exacerbate existing disparities and neglect other vulnerable populations.

Neglect of Other Factors: Relying solely on age distribution may neglect other important factors influencing diabetes prevalence, such as socio-economic status, lifestyle, and genetic factors. A comprehensive approach that considers multiple determinants is essential.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# 6. Doctor distribution
plt.figure(figsize=(10, 6))
df['Doctor'].value_counts().nlargest(10).plot(kind='barh', color='teal', edgecolor='black')
plt.title("Top 10 Doctors by Patient Count")
plt.xlabel("Count")
plt.ylabel("Doctor")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. Top-N Comparison: A horizontal bar chart is effective for comparing the counts of the top N doctors with the highest patient counts. It provides a clear visual representation of the most prominent doctors in terms of patient engagement.

Space-Efficient: A horizontal bar chart is space-efficient when dealing with a limited number of categories (in this case, doctors). It allows for easy comparison of the top doctors without cluttering the visualization.

Readability: Horizontal bar charts are often more readable when dealing with long category labels, such as doctor names. The horizontal orientation allows for better visibility and avoids crowding of labels.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
general insights that you might infer from analyzing the "Top 10 Doctors by Patient Count" horizontal bar chart:

High-Performing Doctors: Identify the doctors with the highest patient counts. These doctors are likely to be the most sought-after or have a significant patient base, indicating a strong reputation or effective medical practice.

Resource Allocation: Understanding which doctors have the highest patient counts helps in resource allocation. Healthcare facilities can optimize scheduling, staffing, and other resources based on the popularity and demand for specific doctors.

Referral Patterns: If certain doctors consistently appear in the top rankings, it may suggest strong referral patterns. Patients might be specifically seeking out these doctors based on recommendations or reputation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive Business Impact:

Strategic Resource Allocation: Knowing which doctors have the highest patient counts allows for strategic resource allocation. Healthcare facilities can optimize scheduling, staffing, and other resources to meet the demand for popular doctors, potentially leading to increased operational efficiency.

Targeted Marketing: Identifying high-performing doctors provides an opportunity for targeted marketing efforts. Healthcare facilities can promote these doctors to attract more patients and enhance their overall visibility in the community.

Enhanced Patient Experience: Efficient practices and high patient counts may indicate positive patient experiences. Understanding what makes these doctors successful can inform strategies to enhance the overall patient experience across the organization.

Potential Negative Impacts:

Resource Imbalance: If resource allocation is solely based on patient counts, there is a risk of resource imbalance. Other doctors with lower patient counts may not receive sufficient support, leading to dissatisfaction or reduced efficiency in their practices.

Quality of Care Concerns: Focusing solely on patient counts without considering the quality of care provided may lead to a potential negative impact. The emphasis should be on maintaining a balance between efficiency and ensuring high-quality healthcare services.

Physician Burnout: Doctors with consistently high patient counts may face challenges related to burnout and fatigue. It's crucial to monitor and address the workload of these doctors to prevent negative consequences on their well-being and the quality of patient care.

#### Chart - 7

In [None]:
#Insurance Provider distribution
plt.figure(figsize=(10, 6))
insurance_provider_counts = df['Insurance Provider'].value_counts().nlargest(10)
insurance_provider_counts.plot(kind='barh', color='coral', edgecolor='black')
plt.title("Top 10 Insurance Providers by Patient Count")
plt.xlabel("Count")
plt.ylabel("Insurance Provider")

# Add values over each bar
for index, value in enumerate(insurance_provider_counts):
    plt.text(value, index, str(value), ha='left', va='center', fontsize=10, color='black')

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here. The specific chart chosen is a horizontal bar chart, and it was selected for the following reasons:

Top-N Comparison: A horizontal bar chart is effective for comparing the counts of the top N insurance providers with the highest patient counts. It provides a clear visual representation of the most prominent insurance providers in terms of patient coverage.

Space-Efficient: A horizontal bar chart is space-efficient when dealing with a limited number of categories (in this case, insurance providers). It allows for easy comparison of the top insurance providers without cluttering the visualization.

Readability: Horizontal bar charts are often more readable when dealing with long category labels, such as insurance provider names. The horizontal orientation allows for better visibility and avoids crowding of labels.

Emphasis on Top Performers: The choice of a horizontal bar chart emphasizes the ranking of the top insurance providers, making it immediately apparent which providers have the highest patient counts. This is particularly useful for identifying high-performing insurance partners.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
general insights that you might infer from analyzing the "Top 10 Insurance Providers by Patient Count" horizontal bar chart:

Dominant Insurance Providers: Identify the insurance providers with the highest patient counts. These providers are likely to be the most widely accepted or utilized by patients within the healthcare facility.

Market Share: The length of the bars in the chart provides a visual representation of the market share held by each of the top insurance providers. This insight can be valuable for understanding the distribution of patients among different insurers.

Provider-Patient Relationships: Insurance providers with higher patient counts may have strong relationships with the healthcare facility. Understanding these relationships can be important for contract negotiations, billing processes, and overall partnership strategies.

Billing and Reimbursement: Considerations related to billing and reimbursement can be derived from the chart. Providers with a higher patient count may have implications for the revenue cycle and financial health of the healthcare facility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive Business Impact:

Strategic Partnerships: Understanding which insurance providers have higher patient counts allows for the development of strategic partnerships with these insurers. Strengthening relationships with dominant insurers can lead to a positive impact on patient referrals and revenue.

Optimized Billing and Reimbursement: Insight into the insurance providers with higher patient counts enables healthcare facilities to optimize billing processes and streamline reimbursement. Efficient management of claims and reimbursements can positively impact financial health.

Targeted Marketing: Knowledge of popular insurance providers can inform targeted marketing efforts. Healthcare facilities can tailor marketing strategies to attract more patients covered by these insurers, enhancing overall patient volume and revenue.

Potential Challenges or Negative Considerations:

Overdependence on Specific Insurers: If a healthcare facility becomes overly dependent on a small number of dominant insurance providers, there is a risk of vulnerability. Changes in contracts or relationships with these providers could have a significant impact on business stability.

Limited Diversification: Relying heavily on a few insurers may limit the diversification of revenue streams. A sudden decrease in patient counts from a dominant insurer could lead to financial challenges.

Contractual Restrictions: Dominant insurance providers may impose contractual restrictions or terms that impact the autonomy of healthcare facilities. Facilities should carefully review contracts to ensure they align with the organization's goals and values.

#### Chart - 8

In [None]:
# 2. Violin plot for Billing Amount distribution
plt.figure(figsize=(10, 6))
sns.violinplot(x='Billing Amount', data=df, palette='Set2')
plt.title('Violin Plot of Billing Amount Distribution')
plt.xlabel('Billing Amount')
plt.ylabel('Range')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
The specific chart chosen is a Violin Plot for the distribution of billing amounts. Here's why this specific chart was selected:

Distribution Insight: Violin plots provide a comprehensive view of the distribution of data. They include information about the median, quartiles, and the overall shape of the distribution, offering insights into the variability and central tendency of the billing amounts.

Kernel Density Estimation (KDE): The width of the violin plot represents the kernel density estimation, giving a visual representation of the probability density of different billing amounts. This is useful for understanding the concentration of billing amounts at different levels.

Outlier Detection: Violin plots often include information about the presence of outliers, which are data points that fall outside the expected range. Outliers in billing amounts may have important implications for financial analysis and resource planning.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
general insights that you might infer from analyzing a Violin Plot of Billing Amount Distribution:

Central Tendency: Examine the central tendency of the billing amounts, which is often represented by the thicker portion of the violin plot. The peak or bulge in the middle can provide insights into the typical or median billing amount.

Spread of Billing Amounts: The width of the violin plot indicates the spread or variability in billing amounts. A wider section suggests greater variability, while a narrower section indicates a more consistent range of billing amounts.

Skewness: Assess the symmetry of the violin plot. If one side is longer or more pronounced than the other, it suggests skewness in the distribution. Skewness can provide insights into whether billing amounts are concentrated toward higher or lower values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive Business Impact:

Optimized Resource Allocation: Understanding the distribution of billing amounts allows healthcare facilities to optimize resource allocation. This includes efficient allocation of staff, equipment, and other resources based on the typical range of billing amounts.

Pricing Strategy: Insights into the central tendency and variability of billing amounts can inform pricing strategies. Facilities can adjust pricing models to align with the median billing amounts and better cater to the financial expectations of patients.

Considerations for Negative Growth:

Financial Vulnerability: A wide spread or presence of outliers in billing amounts may indicate financial vulnerability. If a significant number of billing amounts are exceptionally high or low, it could pose challenges in financial stability and growth.

Inequitable Pricing: If there is substantial variability in billing amounts without clear justification (e.g., variations unrelated to the type or complexity of services), it may lead to perceived inequities in pricing. This can result in negative patient experiences and impact the reputation of the healthcare facility.

Patient Dissatisfaction: If the distribution reveals a concentration of billing amounts toward higher values without corresponding value or quality of service, it may lead to patient dissatisfaction. Negative patient experiences can impact customer retention and the facility's reputation.

#### Chart - 9

In [None]:
# Discharge Date trends
plt.figure(figsize=(12, 6))
df['Discharge Date'] = pd.to_datetime(df['Discharge Date'])
df.set_index('Discharge Date').resample('M').size().plot(label='Monthly Discharges', color='darkgreen')
plt.title('Monthly Discharges Trend')
plt.xlabel('Discharge Date')
plt.ylabel('Count')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
Temporal Trends: The chart captures trends over time, specifically the monthly discharge trend. Time series plots are suitable for visualizing how a variable changes over different time intervals, in this case, months.

Granularity: Monthly granularity provides a balance between capturing meaningful trends and avoiding excessive detail. It's a common and practical time interval for assessing trends in various datasets.

Seasonal Patterns: Time series plots are effective for identifying seasonal patterns or recurring trends. Patterns in monthly discharges may reveal seasonality or cyclicality in patient discharges.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
general insights that you might infer from analyzing the "Monthly Discharges Trend" time series plot:

Seasonal Patterns: Check for any repeating patterns or seasonality in the monthly discharges. If there are consistent peaks or troughs at certain times of the year, it could indicate seasonal variations in patient discharges.

Overall Trend: Assess the overall trend in monthly discharges over the specified time period. A rising trend may suggest an increase in patient discharges over time, while a declining trend may indicate a decrease.

Anomalies or Outliers: Look for any abrupt spikes or drops in the plot. These anomalies may correspond to specific events, such as a surge in admissions or a change in healthcare policies, and warrant further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive Business Impact:

Operational Efficiency: Understanding monthly discharge trends allows healthcare facilities to optimize operational efficiency. Facilities can adjust staffing levels, resource allocation, and bed availability based on anticipated discharge patterns, leading to improved efficiency.

Resource Planning: Insights into discharge trends aid in resource planning. Facilities can align resources such as medical staff, equipment, and support services with the expected demand, reducing the risk of underutilization or overload.

Financial Planning: Monitoring discharge trends over time supports effective financial planning. Facilities can project revenue, manage budgets, and identify opportunities for growth or cost optimization based on historical discharge data.

Considerations for Negative Growth:

Unexpected Dips in Discharges: Sudden and sustained dips in discharge trends may raise concerns. Negative growth in discharges could impact financial sustainability, especially if the facility is dependent on a certain level of patient throughput for revenue.

Overestimation of Resources: If there is a consistent overestimation of discharge trends, it may lead to an overallocation of resources, including staffing and facilities. This inefficiency could result in increased costs and reduced profitability.

#### Chart - 10

In [None]:
# 13. Medication distribution
plt.figure(figsize=(12, 8))
medication_counts = df['Medication'].value_counts().nlargest(10)
medication_counts.plot(kind='barh', color='darkorange', edgecolor='black')
plt.title("Top 10 Medications by Prescription Count")
plt.xlabel("Count")
plt.ylabel("Medication")

# Add values over each bar
for index, value in enumerate(medication_counts):
    plt.text(value, index, str(value), ha='left', va='center', fontsize=10, color='black')

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
The specific chart chosen is a horizontal bar chart depicting the "Top 10 Medications by Prescription Count." Here's why this specific chart was selected:

Top-N Comparison: A horizontal bar chart is effective for comparing the counts of the top N medications with the highest prescription counts. It provides a clear visual representation of the most frequently prescribed medications.

Space-Efficient: A horizontal orientation is space-efficient when dealing with a limited number of categories (medications in this case). It allows for easy comparison of the top medications without cluttering the visualization.

Readability: Horizontal bar charts are often more readable when dealing with long category labels, such as medication names. The horizontal orientation allows for better visibility and avoids crowding of labels.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
general insights that you might infer from analyzing the "Top 10 Medications by Prescription Count" horizontal bar chart:

Most Prescribed Medications: Identify the medications with the highest prescription counts. These medications are likely to be commonly used and play a significant role in patient care within the dataset.

Prescription Popularity: The length of the bars in the chart provides a visual representation of the popularity of each medication in terms of prescription counts. Longer bars indicate higher prescription frequencies.

Clinical Preferences: The medications featured in the top positions may reflect clinical preferences, standard treatment protocols, or guidelines within the healthcare facility or dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive Business Impact:

Efficient Supply Chain Management: Understanding the most prescribed medications allows for efficient supply chain management. Facilities can ensure an adequate stock of these medications, reducing the risk of shortages and ensuring a smooth workflow.

Cost Optimization: Knowledge of frequently prescribed medications presents opportunities for cost optimization. Negotiating favorable pricing, exploring bulk purchasing options, or considering generic alternatives can contribute to financial efficiency.

Quality of Care: Identifying commonly prescribed medications reflects their importance in patient care. Ensuring a consistent and reliable supply of these medications supports the quality of care provided by the healthcare facility.

Considerations for Negative Growth:

Dependency on a Few Medications: Overreliance on a small number of frequently prescribed medications may pose risks. Changes in availability, pricing, or guidelines for these medications could impact the facility's ability to provide certain treatments.

Cost Burden: If the most prescribed medications are high-cost, there may be a financial burden on the facility, particularly if these medications are not adequately reimbursed. This can affect overall financial health and sustainability.

Supply Chain Vulnerability: A concentration on a limited set of medications increases the vulnerability of the supply chain. Any disruptions, such as shortages or supply chain issues for these medications, could negatively impact patient care.

#### Chart - 11

In [None]:
# Chart - 11 Blood Type Distribution
plt.figure(figsize=(10, 6))
blood_type_counts = df['Blood Type'].value_counts()
sns.countplot(x='Blood Type', data=df, palette="Set2")

plt.title('Blood Type Distribution')
plt.xlabel('Blood Type')
plt.ylabel('Count')

# Add values over each bar
for index, value in enumerate(blood_type_counts):
    plt.text(index, value, str(value), ha='center', va='bottom', fontsize=10, color='black')

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
The specific chart chosen is a count plot depicting the distribution of blood types. Here's why this specific chart was selected:

Categorical Distribution: A count plot is suitable for visualizing the distribution of categorical data, such as blood types. It provides a quick and clear overview of the frequency of each category.

Frequency Comparison: The vertical bars in a count plot represent the frequency of each blood type, making it easy to compare the number of occurrences for different blood types within the dataset.

Readability: Count plots are straightforward and easy to interpret. They are particularly effective when dealing with a small number of categories (blood types in this case), allowing for clear visualization without excessive detail.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Prevalence of Blood Types: Identify the most and least prevalent blood types based on the height of the bars. This provides a quick overview of the distribution of blood types within the dataset.

Common Blood Types: If there are prominent peaks in the count plot, these represent the blood types that are most commonly found in the dataset. This information can be useful for healthcare planning and understanding the demographics of the population.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Potential Positive Impact:

Tailored Healthcare Services: Understanding the distribution of blood types allows healthcare providers to tailor their services to the specific needs of the population. For example, it can inform blood donation campaigns, organ transplantation planning, and other medical services.

Efficient Blood Inventory Management: For healthcare facilities involved in blood transfusions, knowledge of prevalent blood types helps in maintaining an efficient and well-balanced blood inventory. This can contribute to improved patient care and satisfaction.

Enhanced Emergency Preparedness: In emergency situations, having insights into the prevalence of different blood types can enhance emergency preparedness. It aids in ensuring an adequate supply of compatible blood for transfusions during critical situations.

Considerations for Negative Growth:

Imbalances in Blood Supply: If there are significant imbalances in the distribution of blood types, it may lead to challenges in maintaining a well-balanced blood supply. Shortages or surpluses of certain blood types could affect patient care and may lead to negative growth in service quality.

Challenges in Organ Transplantation: A skewed distribution may pose challenges in organ transplantation, especially if there is a shortage of donors with specific blood types. This could impact the success rates of transplantation procedures.

#### Chart - 12

In [None]:
#Test Results distribution
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Test Results'], color='lightblue')
plt.title('Distribution of Test Results')
plt.xlabel('Test Results')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
Statistical Summary: Boxplots provide a concise summary of the statistical distribution of a continuous variable, including measures such as the median, quartiles, and potential outliers. This is valuable for understanding the central tendency and spread of test results.

Identification of Central Tendency: The central box in the plot represents the interquartile range (IQR) and the median, offering insights into the central tendency of the test results distribution. This aids in identifying the typical or median value of the test results.

Spread of Data: The length of the box and the whiskers indicate the spread or variability of the test results. This helps in assessing how widely the test results vary from the median and provides a sense of the overall distribution.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Central Tendency: Identify the median value (center of the box). This represents the middle value of the test results distribution. The position of the median can provide insight into the typical or central value of the test results.

Spread of Test Results: Examine the length of the box and the whiskers. A longer box or whiskers indicate a greater spread or variability in the test results. This information helps in understanding the range of values observed in the dataset.

Skewness: Assess the symmetry of the boxplot. If the box is evenly distributed, the data may be symmetrical. If one side of the box is longer than the other, it suggests skewness in the test results distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Potential Positive Impact:

Identification of Normal Range: Understanding the central tendency and spread of test results helps define a normal or typical range. This information is crucial for healthcare professionals to establish benchmarks for healthy individuals and identify deviations from normal values.

Detection of Outliers: The identification of potential outliers in the test results can be valuable for healthcare providers. Outliers may indicate rare conditions, unusual responses to treatment, or errors in testing. Early detection and investigation of outliers contribute to improved patient care.

Considerations for Negative Impact:

Concerns with Outliers: While outliers can be informative, extreme values may lead to unnecessary investigations or interventions if not properly understood. Misinterpretation of outliers could result in additional costs and potential negative impacts on patient well-being.

Data Quality Issues: Skewed or unusual distributions may indicate underlying data quality issues, such as errors in measurement, sample contamination, or systematic biases. Addressing these issues is crucial to maintaining the integrity of healthcare data.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Room number distribution
plt.figure(figsize=(12, 8))
sns.boxplot(x='Room Number', data=df, palette='coolwarm')
plt.title('Box Plot of Room Number Distribution')
plt.xlabel('Room Number')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
Variability Across Room Numbers: The length of the box and the whiskers illustrate the variability or spread of room numbers. A longer box or whiskers indicate greater variability, providing information about the range of room numbers within the dataset.

Outlier Detection: Box plots effectively highlight potential outliers, which are represented as individual points beyond the whiskers. Identifying outliers in room numbers may be valuable for addressing any anomalies or errors in the dataset.

Comparison Across Categories: If room numbers are categorized or have specific characteristics, a box plot allows for easy visual comparison of the distribution across these categories. It aids in understanding how room numbers vary based on different factors.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Central Tendency: Identify the median room number, which is represented by the line inside the box. This gives an indication of the central or typical room number in the dataset.

Spread of Room Numbers: Evaluate the length of the box and the whiskers. A longer box or whiskers indicate a greater spread or variability in room numbers. Understanding this spread is crucial for assessing the range of available room numbers.

Outlier Detection: Look for individual points beyond the whiskers. These points represent potential outliers—room numbers that deviate significantly from the majority. Investigate any outliers for potential data anomalies or specific room characteristics.

Skewness: Assess the symmetry of the box. If the box is evenly distributed, it suggests a more symmetrical distribution of room numbers. Asymmetry in the box may indicate skewness in the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Potential Positive Impact:

Efficient Resource Allocation: Understanding the distribution of room numbers helps in efficiently allocating resources within a healthcare facility. It allows for better management of available rooms, ensuring optimal utilization based on demand.

Improved Patient Experience: Knowledge of the variability in room numbers can contribute to a more tailored patient experience. Facilities can better match room assignments to patient needs, preferences, or medical requirements, enhancing overall satisfaction.

Enhanced Facility Planning: Insights into the central tendency and spread of room numbers aid in facility planning. Facilities can plan expansions or renovations based on the distribution, addressing any limitations or accommodating specific room requirements.


The insights gained from the analysis of the "Box Plot of Room Number Distribution" can potentially contribute to positive business impact in healthcare operations. However, there are considerations that, if not addressed properly, could lead to challenges. Let's explore both aspects:

Potential Positive Impact:

Efficient Resource Allocation: Understanding the distribution of room numbers helps in efficiently allocating resources within a healthcare facility. It allows for better management of available rooms, ensuring optimal utilization based on demand.

Improved Patient Experience: Knowledge of the variability in room numbers can contribute to a more tailored patient experience. Facilities can better match room assignments to patient needs, preferences, or medical requirements, enhancing overall satisfaction.

Enhanced Facility Planning: Insights into the central tendency and spread of room numbers aid in facility planning. Facilities can plan expansions or renovations based on the distribution, addressing any limitations or accommodating specific room requirements.

Optimized Workflow: Understanding the distribution helps in optimizing operational workflows. It enables facilities to streamline processes related to room assignments, patient transfers, and overall logistical management.

Identification of Specialized Rooms: If room numbers are categorized, the analysis can reveal the distribution of specialized rooms (e.g., intensive care units, operating rooms). This information is crucial for planning and ensuring adequate resources for specialized care.

Considerations for Negative Impact:

Resource Imbalances: If the distribution of room numbers indicates imbalances, such as a shortage of specific room types or overcrowding in certain areas, it may lead to operational challenges and negatively impact patient care.

Patient Satisfaction Issues: Inconsistencies in room assignment or a skewed distribution may lead to disparities in patient experiences. Patients may perceive inequalities in the quality of rooms, potentially affecting overall satisfaction.

Operational Inefficiencies: Failure to address issues revealed by the distribution, such as outliers or skewed patterns, can result in operational inefficiencies. For example, difficulties in managing outlier room numbers may lead to increased workload for staff.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
Multivariate Analysis: A heatmap of the correlation matrix is particularly useful for multivariate analysis. It allows for the simultaneous examination of relationships between multiple variables in a compact and visually accessible format.

Correlation Visualization: The heatmap color-codes the correlation coefficients, making it easy to identify the strength and direction of relationships between pairs of variables. This is valuable for understanding how variables co-vary.

Comprehensive Overview: The heatmap provides a comprehensive overview of the entire correlation matrix. Each cell represents the correlation coefficient between two variables, allowing for quick identification of patterns and associations.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Strength and Direction of Correlations: The color-coded cells in the heatmap show the strength and direction of correlations between pairs of variables. Dark colors (positive or negative) indicate stronger correlations, while lighter colors suggest weaker or negligible correlations.

Identifying Positive and Negative Correlations: Focus on the signs of the correlation coefficients. Positive correlations are represented by dark colors, while negative correlations are represented by light colors. This helps identify variables that move together or in opposite directions.

Clusters of Correlated Variables: Look for clusters of dark or light cells, indicating groups of variables that are closely related or unrelated. Identifying these clusters can reveal underlying patterns or relationships within the dataset.

#### Chart - 15 - Pair Plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'df' is your DataFrame and you want to include specific numerical columns
numerical_columns = ['Age', 'Billing Amount', 'Room Number', 'Test Results']

# Set style and palette
sns.set(style="whitegrid", palette="viridis")

# Create pair plot
sns.pairplot(df[numerical_columns], hue=None)
plt.suptitle("Pair Plot of Selected Numerical Columns", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
Multivariate Analysis: A pair plot is a powerful tool for multivariate analysis, allowing you to examine relationships between multiple numerical variables simultaneously. Each scatter plot in the pair plot represents the relationship between two variables.

Visualizing Relationships: Pair plots visualize the relationships between all possible pairs of numerical columns in a concise and systematic manner. This is valuable for exploring patterns, trends, and potential correlations in the data.

Identification of Trends and Outliers: The scatter plots in the pair plot allow for the identification of trends and patterns in the relationships between variables. Additionally, outliers or unusual data points can be visually identified.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
general insights that you might infer from analyzing a pair plot of selected numerical columns:

Correlation Patterns: Check for patterns in the scatter plots. If points generally follow a trend, it suggests a correlation between the two variables. For example, a positive correlation would result in points sloping upwards.

Strength of Relationships: The tightness and direction of the scatter plot patterns indicate the strength and nature of relationships between pairs of variables. Closer and more defined patterns suggest stronger relationships.

Identifying Outliers: Look for individual data points that stand out from the general pattern in the scatter plots. Outliers may indicate unusual observations or errors in the data that warrant further investigation.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***