<a href="https://colab.research.google.com/github/uofldmlab/IntroDMLab/blob/main/demo_lab_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Click the Google Colab image at the top of the page to launch the lab in Google Colab: screenshot-2021-09-08-9-15-39- (1).png 

The following link provides a brief tutorial on how to use Google Colab: https://www.youtube.com/watch?v=inN8seMm7UI&ab_channel=TensorFlow

#Lab #02: Linear Regression & K-Means

# Import Python Libraries

Importing Python libraries extend data-type and function capabilities that are outside the core Python language. The Pandas library is a Python library that is widely used to open, view, manipulate, and analyze data. The Sklearn library is used for machine learning algorithms such as Linear Regression and K-Means. Matplotlib is used for plotting data visualizations.

In [57]:
import pandas as pd #pandas for dataframes
from sklearn.linear_model import LinearRegression #sklearn.linear_model for the Linear Regression model 
import sklearn.metrics as metrics #to measure model performance
import matplotlib.pyplot as plt #for plotting
from sklearn.cluster import KMeans #for kmeans algorithm

plt.rcParams['figure.figsize'] = [15,8] #defining plot size

# Open Data
We will use the pandas read_csv() function to import a CSV file from a URL and store the data into a Pandas dataframe.

In this case we are reading a comma delimited text file (.csv) from: https://raw.githubusercontent.com/uofldmlab/IntroDMLab/main/insurance.csv

This particular text file contains data related to medical charges. The dataset contains 7 different variables; 6 independent (age, BMI, sex, smoker, children, region); 1 dependent (charges).

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/uofldmlab/IntroDMLab/main/insurance.csv")

In [None]:
df.head()

# Select X & Y
Suppose we wanted to use a person's age to predict their medical charges, under the general assumption that a person's medical care costs increase as the person ages. The "age" column will be the X variable in your linear regression equation, while the "charges" column will be the Y variable of the equation. X and Y will have the datatype of a Python list.

In [9]:
X = df[['age']]
Y = df['charges']

#Build Model
We use the X and Y list variables from above to fit a linear regression model. We then compare our model's performance and predicting medical charges against the actual value.

In [12]:
linear_regressor = LinearRegression() #linear regression model
linear_regressor.fit(X, Y) #fit model
Y_pred = linear_regressor.predict(X) #make predictions

#Plot Equation

In [None]:
plt.scatter(X, Y) #plot points
plt.plot(X, Y_pred, color='red') #plot regression line
plt.show() #display plot

#Mean Square Error
The mean square error (MSE) represents the average of the squares of the errors, which is a metric for evaluating the performances of a regression model.

In [None]:
metrics.mean_squared_error(Y, Y_pred)

#Equation
To get the linear regression equation of: y = intercept + slope(X), we call the *intercept_* and *coef_* attributes of our linear regression model.

In [None]:
print('intercept:', linear_regressor.intercept_)
print('slope:', linear_regressor.coef_)


#K-Means Clustering
K-Means clustering is a way to mathematically segment data into groups, where "K" is the number of groups.

#Selecting Fields
In this case we want to use the age and BMI columns to segment the data.

In [None]:
df_numeric = df[['age','bmi']]
df_numeric.head()

#K=3

We use the KMeans function to build clusters of size K (3 in this example). We can then visualize the data on a scatter plot, and color the points by cluster, with the centroid of each cluster in red. The *cluster_centers_* attribute of the KMeans model provides the coordinates for each of the clusters.

In [None]:
k = 3
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(df_numeric)
plt.scatter(df_numeric['age'], df_numeric['bmi'], c= kmeans.labels_.astype(float))
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=50, c='red')
plt.show()
print(kmeans.cluster_centers_)

K=5

In [None]:
k = 5
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(df_numeric)
plt.scatter(df_numeric['age'], df_numeric['bmi'], c= kmeans.labels_.astype(float))
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=50, c='red')
plt.show()
print(kmeans.cluster_centers_)

#Lab Task #1

Build a linear regression model that predicts charges given BMI.

Q1. What is the linear regression equation where X is BMI and Y is the charges?


Q2. What is the MSE of the linear regression model where X is BMI and Y is the charges?



#Lab Task #2
Construct a K-Means cluster (K=3) for BMI and charges.

Q3. What are the centroids of the KMeans cluster (K=3) for BMI and charges?

**Email your responses to Questions #1-#3 to your instructor.**