# Machine Learning Algorithms Basics

## Linear Regression
A model that generates a continuous relationship between independent and dependent variables.

**Goal: find a model that has the least mean square error**

Using the scikit learn package, import the diabetes dataset and use the linear regression model

In [45]:
# import packages
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

import numpy as np

In terms of metrics, mean squared error (MSE) tells us about the distance between the predicted and actual values. The r^2 value tells us if the model can predict a relationship between independent and dependent variables: 0 being no relationship, 1 being 100% relationship prediction. Mean absolute error is the sum of errors is delta between estimated and measured values divided by number of observations.

### Load the dataset

In [46]:
# load dataset
diabetesData = load_diabetes()
x = diabetesData.data
y = diabetesData.target # target means dependent variable in sci-kit learn (linear algebra term)

### Split the x & y data into training and testing data

In [47]:
# set the testing/training data split using 20/80 split, set the random state seed to 42 (used to replicate results later on)
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.35, random_state=42) 


### Train the regression model

In [48]:
# train the linear regression model
linRegModel = LinearRegression()
# fit the training data to the model
linRegModel.fit(xTrain, yTrain)

# Compute the target predictions
yPreds = linRegModel.predict(xTest)

# Evaluate how the model performed
meanSquaredError = mean_squared_error(yTest, yPreds)
r2 = r2_score(yTest, yPreds)
meanAbsError = mean_absolute_error(yTest, yPreds)

# Output the results
print("Mean Squared Error is: " + str(meanSquaredError))
print("R Square Score is: " + str(r2))
print("Mean Abosolute Error is: " + str(mean_absolute_error))

del(x, y, xTrain, yTrain, xTest, yTest, yPreds)

Mean Squared Error is: 2793.8561661483113
R Square Score is: 0.5139510592844927
Mean Abosolute Error is: <function mean_absolute_error at 0x135007060>


Output results indicate that the model is only 45% correct in its predictions compared to the acutal values, and that the model is 2,900 units values away from the true values. Adjusting testing/training values produces better results. 

## Logistic Regression
A group of models that are generally used in classification problems, using probabilities to classify if something belongs to a certain class.

**Metrics:**
1. Accuracy
2. Precision & Recall: ratio of positive predictions to true positives, recall being positive predictions to all predictions in a class
3. F1 Score: Equilibrium between precision and recall

In [41]:
# import the packages, select the dataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


In [49]:
# Load the dataset
breastCancerData = load_breast_cancer()
x = breastCancerData.data
y = breastCancerData.target

In [52]:
# Split the data into training and testing splits
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.25, random_state=42)

In [53]:
# Create and train the logistic regression model
logRegModel = LogisticRegression(max_iter=10000)
logRegModel.fit(xTrain, yTrain)


In [54]:
# Predict test set results
yPreds = logRegModel.predict(xTest)


In [56]:
# Evaluate the model, print results
accuracy = accuracy_score(yTest, yPreds)
precision = precision_score(yTest, yPreds)
recall = recall_score(yTest, yPreds)
f1 = f1_score(yTest, yPreds)

# Print the results
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 Score: ", f1)

Accuracy:  0.965034965034965
Precision:  0.9666666666666667
Recall:  0.9775280898876404
F1 Score:  0.9720670391061452


Results are not bad, these algorithms do really well given these circumstances