<H1><center>Project 3<BR><BR>
Linear Regression and Logistic Regression</center></H1>

<H2>Task 1: Ethics of Artificial Intelligence
</H2>

<P>Read <a href="https://www.wired.com/story/artificial-intelligence-seeks-an-ethical-conscience/">this article</a> about ethical considerations in artificial intelligence.</P>

<P>Please address (minimum 200 words) the following questions in the space below. As the article describes, "many machine-learning systems are now essentially black boxes; their creators know they work, but they can't explain exactly why they make particular decisions." In other words, they are not <em>interpretable</em>. Consider many of the machine learning applications mentioned by the article, e.g., image processing, criminal justice, hiring, healthcare, military. Are there machine learning applications where you would sacrifice accuracy for interpretability? Are there machine learning applications where you would sacrifice interpretability for accuracy? The article concludes with the quote "Are there some things that we just shouldn't build?" What is your response to this question?</P>

In processes that inherently rely on the empathy and interpersonal connection of human interaction (for example, hiring), I believe we would benefit from prioritizing interpretability over accuracy. As the article mentions, these kinds of applications are where the "black box" concerning how machine learning systems make decisions will not be enough to make equitable and informed decisions. Even in cases like healthcare, I think the occasional sacrifice of accuracy is alright since such fields ought to be dependent on human healthcare professionals that use machine learning as a tool, not a replacement, for their services. Society becomes a lot less human when we replace interaction with the implementation of a machine learning system that is not genuinely understood. Just as we value rational actors in society and as a species have devoted centuries to understanding human behavior, so should we study, understand, and (a rare new opportunity) improve technological behavior. A key issue here is the mitigation of harms, especially to the least well off. Under this framework, I believe that there ought to be restrictions on things we should/should not build, especially in terms of things that only serve to harm and not to support sustainable human activity.

<H2>Task 2: Linear Regression
</H2>

<P>In this task, you will use linear regression to predict people's medical costs. Recall that linear regression is often used to predict <em>real</em>-valued results whereas logistic regression is often used to predict <em>classes</em>.</P>

<P>The data for this task <a href="https://www.kaggle.com/mirichoi0218/insurance">comes from Kaggle</a> and can be found in the CSV file <code>insurance.csv</code> that you downloaded as part of this project. For each of 1,338 people, we have the following information:
<UL>
<li>Sex: male or female</li>
<li>BMI: body mass index</li>
<li>Children: number of children</li>
<li>Smoker: no or yes</li>
<li>Residential Region in the US: southwest, southeast, northeast, northwest</li>
<li>Medical Costs: in dollars</li>
</UL>
</P>

<P>Your goal is to predict the annual medical cost for someone based on their sex, bmi, number of children, whether they are a smoker, and their residential region.</P>

<P>One of the challenges is that not all of the data are numerical, and many of our machine learning algorithms prefer to work with numbers rather than, say, strings of text. Correspondingly, it is not straightforward to use the <code>numpy</code> function <code>loadtxt</code> to read in the data from the <code>insurance.csv</code> file. To start, write Python code to read in the data from the <code>insurance.csv</code> file. You should convert any non-numerical features into numerical features. For non-numerical features with only two distinct values, e.g., sex ("male" or "female") and smoker ("no" or "yes"), you can replace the text values with corresponding numerical values of 0 or 1. For the non-numerical features with more than two distinct values, i.e., residential region ("southwest", "southeast", "northeast", "northwest"), you should replace the one feature with four binary features corresponding to whether the person resides in the southwest (0 or 1), whether the person resides in the southeast (0 or 1), whether the person resides in the northeast (0 or 1), and whether the person resides in the northwest (0 or 1). Thus, we will increase the number of features by 3. Changing a single categorical feature into multiple binary features in this way is known as <em>one-hot encoding</em>. Ultimately, you need to store all the data in numerical form in a <code>numpy</code> array <code>X</code> for the nine features and a <code>numpy</code> array <code>y</code> for the real-valued medical costs that you are trying to predict.</P>

<P>Write Python code below to read in the file <code>insurance.csv</code>, convert the non-numerical data into numerical data, and store the results in two <code>numpy</code> arrays <code>X</code> and <code>y</code>.</P>

In [2]:
# Read in CSV file ignoring header row.
# Convert non-numerical data into numerical data.
# Store features vectors in numpy array X and real-valued labels in array y.

import pandas as pd
import numpy as np

#I chose to use Pandas because I initially thought it would be easier to visualize when making changes
# but definitely think it made some components of this process less efficient


#read data and drop header
df = pd.read_csv('insurance.csv', header=None)
df = df.drop(0)


#splits categorical variables into quantitative by category
df[1] = pd.Categorical(df[1])
dfDummies = pd.get_dummies(df[1], prefix = 'gender')
df = pd.concat([df, dfDummies], axis=1)

df[4] = pd.Categorical(df[4])
dfDummies = pd.get_dummies(df[4], prefix = 'smoker')
df = pd.concat([df, dfDummies], axis=1)

df[5] = pd.Categorical(df[5])
dfDummies = pd.get_dummies(df[5], prefix = None)
df = pd.concat([df, dfDummies], axis=1)


#drop no longer relevant columns and rename/reorganize columns
dfBinary = df.drop(columns=[1,4,5,'gender_male','smoker_no'])
dfBinary.head()

dfBinary.rename(columns={0: 'Age', 2:'BMI',3:'Kids',6:'Costs','gender_female':'isFemale','smoker_yes':'isSmoker','northeast':'isNortheast','northwest':'isNorthwest','southeast':'isSoutheast','southwest':'isSouthwest'}, inplace=True)
dfBinary.head()

dfFinal = dfBinary[['Age', 'BMI', 'Kids', 'isFemale','isSmoker','isNortheast','isNorthwest','isSoutheast','isSouthwest','Costs']]
dfFinal.head()


#separate x and y data
dfX = dfFinal.drop(columns=['Costs'])
dfY = dfFinal.drop(columns=['Age', 'BMI', 'Kids', 'isFemale','isSmoker','isNortheast','isNorthwest','isSoutheast','isSouthwest'])

X = dfX.values
y = dfY.values


#reformat x and y data
rows = X.shape[0]
colsX = X.shape[1]
colsY = y.shape[1]

finalY = np.zeros((rows,colsY))
for i in range(len(y)):
    finalY[i] = float(y[i][0])
    
flatX = np.zeros((rows,colsX))
for i in range(len(X)):
    j = 0
    while j <9:
        flatX[i][j] = (float(X[i][j]))
        j+=1

newX = np.zeros(X.shape)
finalX = np.zeros(X.shape)
i = 0
while i<len(flatX):
    finalX[i:(i+9)] = flatX[i:(i+9)]
    i+=9
X = finalX
y = finalY

<P>Once you have your feature vectors in array <code>X</code> and your real-valued labels in array <code>y</code>, split the data into training (80%) and testing (20%) and train a linear regression classifier using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"><code>sklearn.linear_model.LinearRegression</code></a>. Then <code>score</code> your model on the <em>test</em> data and report the R<sup>2</sup> coefficient score. If you print out the attribute <code>coef&#95;</code> of your <code>LinearRegression</code> instance, you can see the feature weights of your trained model.</P>

In [3]:
# Split data into training (80%) and testing (20%)
# Train linear regression model on training data.
# Predict medical costs for testing data and score the results.
# Print out feature weights of trained LinearRegression model.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
reg = LinearRegression().fit(X_train, y_train)
print(reg.score(X_test, y_test))
print(reg.coef_)


0.7623311844057113
[[  257.49024669   321.62189278   408.06102001   242.15306559
  23786.48604536   584.37636275   188.27979919  -453.99951691
   -318.65664503]]


In [4]:
dfX.head()

Unnamed: 0,Age,BMI,Kids,isFemale,isSmoker,isNortheast,isNorthwest,isSoutheast,isSouthwest
1,19,27.9,0,1,1,0,0,0,1
2,18,33.77,1,0,0,0,0,1,0
3,28,33.0,3,0,0,0,0,1,0
4,33,22.705,0,0,0,0,1,0,0
5,32,28.88,0,0,0,0,1,0,0


<P><font color="maroon"><u>What is the score, i.e., R<sup>2</sup> coefficient, of your linear regression model on the <em>test</em> data? What feature has the highest (absolute value) weight, i.e., contributes most to determining medical costs? What feature has the lowest (absolute value) weight, i.e., contributes least to determining medical costs?<u></font></P>

score = 0.76


isSmoker has the highest weight and contributes the most to determining costs while isNorthwest has the lowest absolute value weight and contributes the least.

<P>While walking around campus, have you bumped into Wendy Wellesley lately? She is 21 years old, female, has a BMI of 28.5, has no children, does not smoke, and lives in the northeast. Use your trained linear regression model to predict her medical cost this year.</P> 

In [5]:
# Predict medical costs for Wendy Wellesley, who is 21 years old, female, 
# has a BMI of 28.5, has no children, does not smoke, and lives in the northeast.

Xnew = [[21,28.5,0,1,0,1,0,0,0]]
ynew = reg.predict(Xnew)

ynew

array([[3275.90911617]])

<P><font color="maroon"><u>What is Wendy Wellesley's predicted medical cost?<u></font></P>

The model predicts that her medical cost will be $3275 this year.

<H2>Task 3: Non-Linear Logistic Regression
</H2>

<P>While you used <em>linear</em> regression in the previous task, you will now turn to <em>logistic</em> regression for the remainder of this project. To start, you will explore non-linear logistic regression.</P>
<P>The code below reads in a file of data corresponding to two classes and plots the data.</P>

In [6]:
# Read in data and store feature vectors in array X and labels in array y
import numpy as np
np.random.seed(42)
DATA = np.loadtxt('nonlinear.csv', delimiter=',', skiprows=1)
X = DATA[:,:-1]
y = DATA[:,-1]

In [7]:
# Plot data
import matplotlib.pyplot as plt
plt.scatter(X[:,0], X[:,1], c=y, s=40, edgecolor='k')
plt.xticks([])
plt.yticks([])
plt.show()

<Figure size 640x480 with 1 Axes>

<P>You will use logistic regression to classify the above data. However, looking at the plot above, the data do not appear to be linearly separable. Thus, you will be generating polynomial combinations of the features in order to create new features prior to classifying the data with logistic regression.</P>
<P>First, let's split the data into training and testing data...</P>

In [8]:
# Split data into 80% training and 20% testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

<P>Your task is to explore how logistic regression performs when using different polynomial combinations of the features. Here, you should experiment with polynomial combinations of degree 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. A 1 degree polynomial combination of the features is simply the original features. A 2 degree polynomial combination of the features includes squared versions of the original features. A 3 degree polynomial combination of the features includes cubic versions of the original features. And so on. For each of 10 different polynomial degrees, you should create new features corresponding to polynomial combinations of the original features. You should use <a href = "http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html"><code>sklearn.preprocessing.PolynomialFeatures</code></a> to generate the new sets of feature combinations. When generating features, you should first use the <em>training</em> data to <code>fit</code> the features and then <code>transform</code> separately both the <em>training</em> data and the <em>testing</em> data to create new features.</P>

<P>For each of the 10 polynomial degrees, you should train an <a href = "http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"><code>sklearn</code> Logistic Regression classifier</a> on the transformed <em>training</em> data and then report the accuracy of the classifier on the transformed <em>testing</em> data. Thus, you should report 10 accuracies, one for each polynomial degree.</P>

In [9]:
# Generate features corresponding to different degree polynomial combinations and print the accuracy for each degree

from sklearn.preprocessing import PolynomialFeatures
scores = []
for i in range(1,11):
    poly = PolynomialFeatures(degree=i)
    X_train_ = poly.fit_transform(X_train)
    X_test_ = poly.fit_transform(X_test)
    
    reg = LinearRegression()
    reg.fit(X_train_, y_train)

    scores.append(reg.score(X_test_, y_test))
scores

[0.19042589489945372,
 0.20610633519508093,
 0.3160770990207722,
 0.5738645602589436,
 0.6012891726854366,
 0.6834656193081929,
 0.6896247951314942,
 0.7092115190455138,
 0.5322790739548255,
 -2.6703973348261183]

<P><font color="maroon"><u>For which of the ten polynomial degrees does the logistic regression classifier achieve the highest accuracy? What is the accuracy of the classifier at this optimal degree?<u></font></P>

The most accurate classifier is at degree 8, accuracy = 0.71

<P>It's worth noting that we are using <em>test</em> data above to evaluate different degree polynomials. If we were selecting the best degree polynomial to use as part of our classification pipeline, we would be tuning a hyperparameter (the degree of the polynomial) and we would need to instead evaluate the different degrees on <em>validation</em> data.</P>

<H2>Task 4: Evaluating Classifiers for Diagnosing Cancers as Benign or Malignant 
</H2>

<P>For the next few tasks, you will be using <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)">breast cancer data</a> to diagnose whether a cancer is benign (0) or malignant (1). The features are derived from a digitized image of a fine needle aspirate of a breast mass. The features describe characteristics of the cell nuclei present in the image, including the radius, texture, perimeter, area, smoothness, compactness, concavity, number of concave points, symmetry, and fractal dimension, with the mean, standard error, and worst value provided for each.</P>
<P>To begin, read in the CSV file <code>breast_cancer.csv</code>, ignoring the header line, and store the feature vectors in an array <code>X</code> and the labels in an array <code>y</code>.</P>

In [11]:
# Read in data and store feature vectors in array X and labels in array y

tempList = []
import csv
with open('breast_cancer.csv', 'r') as f:
    reader = csv.reader(f, delimiter=',')
    next(reader, None)
    for row in reader:
        tempList.append(row)

rows = len(tempList)
y = np.zeros((rows,1))

for i in range(rows):
    y[i] = int(tempList[i][-1][0])



rowsX = len(tempList)
colsX = len(tempList[0])-1
X = np.zeros((rowsX,colsX))
for i in range(rowsX):
    j = 0
    while j<30:
        X[i][j] = float(tempList[i][j])
        j+=1

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

<P>Now seed the <code>numpy</code> random number generator and then split the data into 80% training data and 20% testing data.</P>

In [14]:
# Seed the random number generator.
# Split data into 80% training and 20% testing


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=1)


<P>Let's evaluate the performance of four different <code>sklearn</code> classifiers, a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">decision tree</a>, a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"><em>k</em> nearest neighbors classifier</a>, a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html">perceptron</a>, and a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">logistic regression classifier</a>. Since we may end up further using the best performing of the four classifiers, it is as if we are tuning a hyperparameter (where the hyperparameter is the classification algorithm), so we will evaluate the classifiers on <em>validation</em> data rather than <em>testing</em> data.</P>

<P>One option would be to split the <em>training</em> data into separate sets, one used only as <em>training</em> data and one used only as <em>validation</em> data. Instead, we will use 5-fold cross-validation where we will split the <em>training</em> data into five equal sized sets. Then, five times, we will use four of the sets as <em>training</em> data and the remaining set as <em>validation</em> data. The <em>validation</em> accuracy that we report will be the average validation accuracy over the five trials.</P>

<P>Thus, for each of the four classifiers, you should use 5-fold cross validation, and you should report the average accuracy of the classifier over the five validation trials. You should compute the average accuracy of a classifier from 5-fold cross validation using the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score"><code>sklearn cross_val_score</code></a> function. For each classifer, you should use its default parameter settings, except for the Perceptron where you should set the number of epochs to 10.</P>

In [32]:
# Compute the average 5-fold cross-validation accuracy for each of four classifiers (decision tree, kNN, perceptron, logistic regression)


from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression


m = DecisionTreeClassifier()
m.fit(X_train,y_train)
m.score(X_test,y_test)
print(cross_val_score(m,X_test,y_test,cv=5,scoring='accuracy'))
print(np.mean(cross_val_score(m,X_test,y_test,cv=5)))



m = KNeighborsClassifier()
m.fit(X_train,y_train)
m.score(X_test,y_test)
print(cross_val_score(m,X_test,y_test,cv=5,scoring='accuracy'))
print(np.mean(cross_val_score(m,X_test,y_test,cv=5)))


m = Perceptron(max_iter=10)
m.fit(X_train,y_train)
m.score(X_test,y_test)
print(cross_val_score(m,X_test,y_test,cv=5,scoring='accuracy'))
print(np.mean(cross_val_score(m,X_test,y_test,cv=5)))


m = LogisticRegression()
m.fit(X_train,y_train)
m.score(X_test,y_test)
print(cross_val_score(m,X_test,y_test,cv=5,scoring='accuracy'))
print(np.mean(cross_val_score(m,X_test,y_test,cv=5)))


[0.83333333 0.95833333 0.95454545 0.86363636 0.86363636]
0.8954545454545455
[0.95833333 0.95833333 0.90909091 0.90909091 0.86363636]
0.9196969696969697
[0.875      0.375      0.86363636 0.63636364 0.63636364]
0.6772727272727272
[0.91666667 1.         0.90909091 0.90909091 0.86363636]
0.9196969696969697


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, war

<P><font color="maroon"><u>Which classifier yielded the highest cross-validation accuracy? What is the cross-validation accuracy of the logistic regression classifier?<u></font></P>

KNN and Logistic Regression yielded the highest accuracy on average across cross validation: 0.92

<H2>Task 5: Regularization 
</H2>

<P>While you explored different classifiers above, let's now focus only on logistic regression classification of the breast cancer data. In particular, let's explore <em>regularized</em> logistic regression. The <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"><code>sklearn</code> logistic regression classifier</a> has a parameter <code>C</code> that controls regularization strength, like the regularization parameter &lambda; that we studied in class. However, the parameter <code>C</code> corresponds to the <em>inverse</em> of regularization strength so that smaller values of <code>C</code> specify stronger regularization. Below, you should experiment with seven different values for the parameter <code>C</code>: 1, 3, 10, 30, 100, 300, 1000. For each of these seven values for <code>C</code>, report the average 5-fold cross-validation accuracy of a logistic regression classifier.</P>

In [30]:
# Using 5-fold cross validation, tune the regularization hyperparameter C for logistic regression



cs = [1.,3.,10.,30.,100.,300.,1000.]
cv_scores = []
for c in cs:
    lr = LogisticRegression(C=c)
    scores = cross_val_score(lr, X, y, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

cv_scores

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[0.9509041939207385,
 0.9544132358599462,
 0.9509041939207385,
 0.9526125432858791,
 0.9490727202770296,
 0.957829934590227,
 0.9561523662947288]

<P><font color="maroon"><u>Of the seven values you experimented with for the parameter <code>C</code>, which led to the highest average 5-fold cross-validation accuracy and what was its accuracy?<u></font></P>

C=300 had the highest average accuracy of 0.958

<H2>Task 6: Recall, Precision, and F1 Score
</H2>

<P>In Task 4 above, you determined the optimal value (among seven possibilities) for the regularization parameter <code>C</code> used by a logistic regression classifier on the breast cancer data. Using this optimal value for the parameter <code>C</code>, again train a logistic regression classifier on the breast cancer <em>training</em> data. In this task, rather than report the <em>accuracy</em> of the classifier on the <em>testing</em> data, you should report the <em>recall</em>, the <em>precision</em>, and the <em>F1</em> score of the classifier on the <em>testing</em> data. You may use <a href="http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html"><code>sklearn.metrics.recall_score</code></a>, <a href="http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html"><code>sklearn.metrics.precision_score</code></a>, and <a href="http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html"><code>sklearn.metrics.f1_score</code></a> to compute the three scores.</P>

In [41]:
# Using the optimal value for the regularization parameter C,
# report the recall, precision, and F1 score of a logistic regression classifier.

from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

m = LogisticRegression(C=300)
m.fit(X_train,y_train)
ypred = m.predict(X_test)
print(recall_score(y_test, ypred))
print(precision_score(y_test, ypred))
print(f1_score(y_test, ypred))


0.9285714285714286
0.975
0.951219512195122


  y = column_or_1d(y, warn=True)


<P><font color="maroon"><u>What is the recall, precision, and F1 score of the logistic regression classifier on the <em>testing</em> data?<u></font></P>

Recall: 0.929, precision: 0.975, f1 score: 0.951

<H2>Task 7: Feature Scaling
</H2>

<P>If you explore the breast cancer data, you will note that some features take on values in the thousands whereas other features never achieve values larger than 0.1. The features are not all on the same scale. Thus, you should perform feature scaling on the data prior to using your classifier.</P>

<P>Using <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"><code>sklearn.preprocessing.StandardScaler</code></a>, you should perform feature scaling on the data. First, you should <code>fit</code> the <code>StandardScaler</code> with the <em>training</em> data. Then you can separately <code>transform</code> the <em>training</em> data and the <em>testing</em> data into "feature scaled" <em>training</em> data and "feature scaled" <em>testing</em> data, respectively. Finally, train your logistic regression classifier using the "feature scaled" <em>training</em> data and optimal regularization parameter <code>C</code> as determined in Task 4 above, and report the F1 score of the classifier on the "feature scaled" <em>testing</em> data.</P>

In [49]:
# After performing feature scaling (and using the optimal value for the regularization parameter C),
# report the F1 score of a logistic regression classifier.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train,y_train)
ft = scaler.transform(X_train)
ftt = scaler.transform(X_test)

m = LogisticRegression(C=300)
m.fit(ft,y_train)
ypred = m.predict(ftt)
print(f1_score(y_test, ypred))

0.9523809523809523


  y = column_or_1d(y, warn=True)


<P><font color="maroon"><u>After applying feature scaling to the data, what is the F1 score of the logistic regression classifier on the <em>testing</em> data? Did the F1 score improve as a result of applying feature scaling to the data?<u></font></P>

the F1 score is 0.952 which is an improvement from 0.951

<H2>Submitting your work
</H2>

<P><font color="maroon"><u>Please indicate your name and the names of any partner that worked with you on this project:</u></font></P>

Name(s): Shreya Parjan

<P><font color="maroon"><u>Please indicate anyone else that you collaborated with in the process of doing the project:</u></font></P>

Collaborators: Emily Yin, went to help room too

<P><font color="maroon"><u>If you or your partner is using a late coupon, please indicate who is using the coupon and how many coupons:</u></font></P>

Late coupons: 

<P><font color="maroon"><u>When working on this project, approximately how many hours did you spend on each of (1) Task 1, (2) Task 2, (3) Task 3, (4) Task 4, (5) Task 5, (6) Task 6, (7) Task 7, and (8) Total?</u></font></P>

Hours on Task 1: 30 min
Hours on Task 2: 2 hr
Hours on Task 3: 1 hr
Hours on Task 4: 1 hr
Hours on Task 5: 10 min
Hours on Task 6: 10 min
Hours on Task 7: 10 min
Total hours: 5

<P><font color="maroon"><u>When working on this project, did you abide by the <a href="https://www.wellesley.edu/studentlife/aboutus/honor">Honor Code</a> and is all of the work that you are submitting your own and/or your partner's?</u></font></P>

Abide by Honor Code: Yes

<P><font color="maroon"><u>To submit this project, please upload your <code>Project3.ipynb</code> file to the <code>Project3</code> folder that the instructor created and shared with you in your Google drive.</u></font></P>