---------------------------


# Performance Assessment: D208 Predictive Modeling Task 1 - Multiple Linear Regression.

## Michael Hindes
Department of Information Technology, Western Governors University
<br>D208: Predictive Modeling
<br>Professor David Gagner
<br>February 11, 2024

This project aims to create a multiple regression model derived from medical raw data, targeting a business question reflective of a real-world organizational challenge. Python is employed to conduct a multiple regression analysis to explore the research question thoroughly. The analysis is supported by visual aids to elucidate the regression outcomes and predictions. The process also involves meticulous data cleaning to ensure accuracy and reliability. Additionally, the project shares the code used for the regression analysis and predictions. It concludes by detailing the regression equation, evaluating the statistical and practical significances, discussing limitations, and suggesting possible actions.


# Part I: Research Question
## Describe the purpose of this data analysis by doing the following::

### **A1. Research Question:**
**"What factors contribute to the total charges incurred by patients during their hospital stay?"**

This question aims to identify key variables within the dataset that influence hospital charges, including length of stay, services rendered, patient risk factors, and demographic details. The goal is to understand the primary drivers of hospital expenses. This information can be used to help predict charges for future patients, allowing hospitals to better manage their resources and improve patient care.

### **A2. Define the goals of the data analysis.**

This data analysis project is focused on developing a predictive model as a practical 
tool to help healthcare organizations in planning and operational improvements. By 
examining a wide range of factors that potentially affect TotalCharge, the project 
aims to build a model that supports data-driven decision-making in healthcare. This 
initiative represents a preliminary step towards leveraging predictive modeling for 
financial sustainability and greater transparency.

-   Variable Identification: Identify a comprehensive set of factors that influence 
TotalCharge, with a focus on clinical, operational, and demographic elements. This 
step lays the groundwork for understanding the broad variables that could impact 
hospital charges.

-   Quantitative Assessment: Conduct a quantitative analysis to evaluate how these 
factors contribute to TotalCharge. This will help in understanding the significance 
and relationships of different variables with TotalCharge, providing a basis for the 
predictive model.

-   Insight Generation: The aim is to generate preliminary insights that could inform 
hospital cost management and pricing strategies, potentially leading to improved 
operational and billing processes. These insights are seen as an initial foray into 
optimizing hospital operations.

-   Predictive Modeling: The core goal is to develop an initial predictive model that 
estimates TotalCharge based on factors identifiable prior to or at the point of 
admission. This model is intended to enhance financial planning and increase 
transparency for both the hospital and its patients, serving as a first step towards 
more sophisticated predictive capabilities in the future.



-------------------------------------


# Part II: Method Justification

## B. Describe multiple linear regression methods by doing the following:

### **B1. Summarize four assumptions of a multiple linear regression model:**

In multiple linear regression analysis, four key assumptions are critical: linearity between variables, independence of observations, constant error variance (homoscedasticity), and normal distribution of error terms. Understanding and checking these assumptions is essential for the model's reliability and accuracy, providing a solid basis for predictive analytics.

-   Linearity asserts that there is a straight-line relationship between each predictor (independent variable) and the response (dependent variable). This means that changes in a predictor variable are associated with proportional changes in the response variable.

-   Independence of Observations indicates that the data points in the dataset do not influence each other. Each observation's response is determined by its predictor values, free from the effects of other observations in the dataset.

-   Homoscedasticity refers to the requirement that the error terms (differences between observed and predicted values) maintain a consistent variance across all levels of the independent variables. This constant variance ensures that the model's accuracy does not depend on the value of the predictors.

-    Normality of Errors involves the assumption that for any fixed value of an independent variable, the error terms are normally distributed. This normal distribution is central to conducting various statistical tests on the model's coefficients to determine their significance.

(Statology, n.d.)
(Pennsylvania State University, n.d.)

### **B2. Describe two benefits of using Python for data analysis:**

- **Rich Library Ecosystem:** Python offers comprehensive libraries such as Pandas for data manipulation, NumPy for numerical computations, Matplotlib and Seaborn for visualization, and Scikit-learn for machine learning, facilitating a wide range of data analysis tasks.
- **Versatility and Community Support:** Python's syntax is intuitive and readable, making it accessible for beginners and versatile for various tasks beyond data analysis. The extensive community support ensures abundant resources for troubleshooting and learning.

### **B3. Explain why multiple linear regression is an appropriate technique for analyzing the research question summarized in part I:**

Multiple linear regression is apt for exploring our research question because it enables the identification and quantification of relationships between a continuous outcome variable (Total Charges) and multiple predictor variables. This method allows for the analysis of how individual factors, such as length of hospital stay, patient demographics, and received services, collectively influence the total hospital charges, providing insights essential for predictive modeling and decision-making in healthcare.


# Part III: Data Preparation

## C. Summarize the data preparation process for multiple linear regression analysis by doing the following:

### *C1. Describe your data cleaning goals and the steps used to clean the data to achieve the goals that align with your research question including your annotated code.**

*   **Importing the Data**: Use`pd.read_csv()` to import data into a Pandas DataFrame.
    
*   **Initial Data Examination**: Using `df.head()` provides a quick snapshot of the dataset, including a view of the first few rows. This helps in getting a preliminary understanding of the data's structure and content.
    
*   **Checking Data Types**: The `df.info()` method is used for assessing the dataset's overall structure, including the data types of each column and the presence of non-null values. 
    
*   **Identifying Duplicate Rows**: Utilizing `df.duplicated()` to find duplicate rows is an essential cleaning step. Duplicates can skew your analysis and lead to inaccurate models. Once identified, you can decide whether to remove these rows with `df.drop_duplicates()` depending on their relevance to your research question.
    
*   **Detecting Missing Values**: The `df.isnull().sum()` command is instrumental in identifying missing values across the dataset. Understanding where and how much data is missing is critical for deciding on imputation methods or if certain rows/columns should be excluded from the analysis.



MAybe later: *   **Reviewing Unique Values**: Although `df.unique()` is used to explore unique values in a Series, for dataframes, you might consider `df.nunique()` to see the number of unique values in each column or use `df['column_name'].unique()` to check unique values in specific columns. This step is valuable for understanding the diversity of information within your dataset, particularly for categorical data.


follow the slides here: https://westerngovernorsuniversity.sharepoint.com/:p:/r/sites/DataScienceTeam/_layouts/15/Doc.aspx?sourcedoc=%7B285C378F-8089-4758-9ABE-29976D079B56%7D&file=Dr.%20Sewell%20D208_Predictive_Modeling_Webinar_Episode%201t.pptx&action=edit&mobileredirect=true


-------------------------------------


--------------------------------------------

--------------------------------------------

--------------------------------------------

----------------------------------------------------------------------

--------------------

### G & H: References

- Western Governors University. (2023, December 21). D207 - Medical_clean Dataset. Retrieved from https://lrps.wgu.edu/provision/227079957

- Western Governors University IT Department. (2023). R or Python? How to decide which programming language to learn. Retrieved from https://www.wgu.edu/online-it-degrees/programming-languages/r-or-python.html#

- Datacamp. (2023, December 12). D207 - Exploratory Data Analysis. Retrieved from https://app.datacamp.com/learn/custom-tracks/custom-d207-exploratory-data-analysis 

- Sewell, Dr. (2023). WGU D207 Exploratory Data Analysis [Webinars]. WGU Webex. Accessed December, 2023. https://wgu.webex.com/webappng/sites/wgu/meeting/info/c4aca2eac546482880f1557c938abf40?siteurl=wgu&MTID=me73470c2eac9e863c6f47a3d5b6d2f26 

- Seaborn Developers. (2023). seaborn.scatterplot — seaborn 0.11.2 documentation. Retrieved December 22, 2023, from https://seaborn.pydata.org/generated/seaborn.scatterplot.html

OLD ABOVE _ DELETE?KEEP? as needed.

- Statology. (n.d.). *The Five Assumptions of Multiple Linear Regression*. Statology. Retrieved March 10, 2024, from www.statology.org/multiple-linear-regression-assumptions/

- Pennsylvania State University. (n.d.). *5.3 - The Multiple Linear Regression Model*. STAT 501. Retrieved March 10, 2024, from online.stat.psu.edu/stat501/lesson/5/5.3



# Limitations

Beware of the following with your regression analysis:

Overfitting can occur due to limited data points.

Multicollinearity occurs when high association (correlation) with other IVs.

P-values can be unreliable and coefficients swing wildly

Check for pairwise correlations and high VIF (> 10)

Tune your model with as many variables as practical. Forward, backward, stepwise
    regression based on AIC, BIC, etc.
ppoint 5 https://westerngovernorsuniversity-my.sharepoint.com/:p:/g/personal/william_sewell_wgu_edu/ERPQ0YpiQktOl-7YyAVnfLMBR5qeBh2cSv61VaJqe_aHKg?e=FjPhPz
