# Programming Assignment #4


## 1. Linear Regression using scikit-learn

The diamonds dataset contains the price, cut, color, and other characteristics of a sample of nearly 54,000 diamonds. This data can be used to predict the price of a diamond based on its characteristics. Use sklearn's LinearRegression() function to predict the price of a diamond from the diamond's carat and table values.

- Import needed packages for regression.
- Initialize and fit a multiple linear regression model.
- Get the estimated intercept weight.
- Get the estimated weights of the carat and table features.
- Predict the price of a diamond with the user-input carat and table values.

Ex: If the input is:

- 0.5
- 60

the output should be:

- Intercept is 1961.992
- Weights for carat and table features are [7820.038  -74.301]
- Predicted price is [1413.97]

In [3]:
# Import needed packages for regression
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Silence warning from sklearn
import warnings
warnings.filterwarnings('ignore')

# Input feature values for a sample instance
carat = float(input())
table = float(input())

diamonds = pd.read_csv('diamonds.csv')

# Define input and output features
X = diamonds[['carat', 'table']]
y = diamonds['price']

# Initialize a multiple linear regression model
model = LinearRegression()

# Fit the multiple linear regression model to the input and output features
model.fit(X, y)

# Get estimated intercept weight
intercept = model.intercept_
print('Intercept is', round(intercept, 3))

# Get estimated weights for carat and table features
coefficients = model.coef_
print('Weights for carat and table features are', np.round(coefficients, 3))

# Predict the price of a diamond with the user-input carat and table values
prediction = model.predict([[carat, table]])
print('Predicted price is', np.round(prediction, 2))

 0.5
 60


Intercept is 1961.992
Weights for carat and table features are [7820.038  -74.301]
Predicted price is [1413.97]


## 2. Logistic Regression using scikit-learn

The **nbaallelo_log** file contains data on 126314 NBA games from 1947 to 2015. The dataset includes the features **pts, elo_i, win_equiv, and game_result**. Using the csv file **nbaallelo_log.csv** and scikit-learn's **LogisticRegression()** function, construct a logistic regression model to classify whether a team will win or lose a game based on the team's elo_i score.

- Create a binary feature win for **game_result** with 0 for L and 1 for W
- Use the **LogisticRegression()** function to construct a logistic regression model with **win** as the target and **elo_i** as the predictor
- Print the weights and intercept of the fitted model
- Find the proportion of instances correctly classified
  
Note: Use **ravel()** from **numpy** to flatten the second argument of **LogisticRegression.fit()** into a 1-D array.

Ex: If the program uses the file **nbaallelo_small.csv**, which contains 100 instances, the output is:

- w1: [[3.64194406e-06]]
- w0: [-2.80257471e-09]
- 0.5

In [31]:
# Import the necessary libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load nbaallelo_log.csv into a dataframe
NBA = pd.read_csv('nbaallelo_small.csv')

# Create binary feature for game_result with 0 for L and 1 for W
NBA['win'] = NBA['game_result'].apply(lambda x: 1 if x == 'W' else 0)

# Store relevant columns as variables
X = NBA[['elo_i']]
y = NBA[['win']].values.ravel()

# Initialize and fit the logistic model using the LogisticRegression() function
model = LogisticRegression()
model.fit(X, y)

# Print the weights for the fitted model
print('w1:', model.coef_)

# Print the intercept of the fitted model
print('w0:', model.intercept_)

# Find the proportion of instances correctly classified
score = model.score(X, y)
print(round(score, 3))

w1: [[0.01584847]]
w0: [-20.5904548]
0.62


## 3. Support Vector Classifier using scikit-learn

The heart dataset contains 13 health-related attributes from 303 patients and one attribute denoting whether or not the patient has heart disease. Using the file heart.csv and scikit-learn's LinearSVC() function, fit a support vector classifier to predict whether a patient has heart disease based on other health attributes.

- Import the correct packages and functions.
- Split the data into 75% training data and 25% testing data. Set random_state=123.
- Initialize and fit a support vector classifier with C=0.2, a maximum of 500 iterations, and random_state=123.
- Print the model weights.

Ex: If the program input is heart_small.csv, which contains 100 instances, the output is:

0.6

w0: [0.013]
w1 and w2: [[ 0.361 -0.087]]

In [35]:
# Import the necessary packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

heart = pd.read_csv('heart_small.csv')

# Input features: thalach and age
X = heart[['thalach', 'age']]

# Output feature: target
y = heart[['target']]

# Create training and testing data with 75% training data and 25% testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

# Scale the input features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize a support vector classifier with C=0.2 and a maximum of 500 iterations
SVC = LinearSVC(C=0.2, max_iter=500, random_state=123)

# Fit the support vector classifier according to the training data
SVC.fit(X_train, y_train)

# Evaluate model on testing data
score = SVC.score(X_test, np.ravel(y_test))
print(np.round(score, 3))

# Print the model weights
# w0
print('w0:', np.round(SVC.intercept_, 3))

# w1 and w2
print('w1 and w2:', np.round(SVC.coef_, 3))

0.6
w0: [0.013]
w1 and w2: [[ 0.361 -0.087]]


## 4. k-Nearest Neighbors using scikit-learn 
The dataset SDSS contains 17 observational features and one class feature for 10000 deep sky objects observed by the Sloan Digital Sky Survey. Use sklearn's KNeighborsClassifier() function to perform kNN classification to classify each object by the object's redshift and u-g color.

- Import the necessary modules for kNN classification
- Create dataframe X with features redshift and u_g
- Create dataframe y with feature class
- Initialize a kNN model with k=3
- Fit the model using the training data
- Find the predicted classes for the test data
- Calculate the accuracy score using the test data

Ex: If the feature u is used rather than u_g, the output is:
- Accuracy score is 0.979

In [39]:
# Import needed packages for classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Import packages for evaluation
from sklearn.metrics import accuracy_score

# Load the dataset
skySurvey = pd.read_csv('SDSS.csv')

# Create a new feature from u - g
skySurvey['u_g'] = skySurvey['u'] - skySurvey['g']

# Create dataframe X with features redshift and u_g
X = skySurvey[['redshift', 'u_g']]

# Create dataframe y with feature class
y = skySurvey['class']

np.random.seed(42)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Initialize model with k=3
skySurveyKnn = KNeighborsClassifier(n_neighbors=3)

# Fit model using X_train and y_train
skySurveyKnn.fit(X_train, y_train)

# Find the predicted classes for X_test
y_pred = skySurveyKnn.predict(X_test)

# Calculate accuracy score
score = accuracy_score(y_test, y_pred)

# Print accuracy score
print('Accuracy score is ', end="")
print('%.3f' % score)

Accuracy score is 0.984


## 5. Naive Bayes using scikit-learn 

The file SDSS contains 17 observational features and one class feature for 10000 deep sky objects observed by the Sloan Digital Sky Survey. Use sklearn's GaussianNB() function to perform Gaussian naive Bayes classification to classify each object by the object's redshift and u-g color.

- Import the necessary modules for Gaussian naive Bayes classification
- Create dataframe X with features redshift and u_g
- Create dataframe y with feature class
- Initialize a Gaussian naive Bayes model with the default parameters
- Fit the model
- Calculate the accuracy score

Note: Use ravel() from numpy to flatten the second argument of GaussianNB.fit() into a 1-D array.

Ex: If the feature u is used rather than u_g, the output is:

- Accuracy score is 0.987

In [41]:
# Import the necessary modules
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the dataset
skySurvey = pd.read_csv('SDSS.csv')

# Create a new feature from u - g
skySurvey['u_g'] = skySurvey['u'] - skySurvey['g']

# Create dataframe X with features redshift and u_g
X = skySurvey[['redshift', 'u_g']]

# Create dataframe y with feature class
y = skySurvey['class']

# Initialize a Gaussian naive Bayes model
skySurveyNBModel = GaussianNB()

# Fit the model
skySurveyNBModel.fit(X, y)

# Calculate the proportion of instances correctly classified
score = skySurveyNBModel.score(X, np.ravel(y))

# Print accuracy score
print('Accuracy score is ', end="")
print('%.3f' % score)

Accuracy score is 0.987


## 6. Ensemble methods using scikit-learn 

## 6.1. Bagging using scikit-learn 
The msleep_clean dataset contains information on sleep habits for 47 mammals. Features include length of REM sleep, time spent awake, brain weight, and body weight.

- Create a dataframe X containing the features awake, brainwt, and bodywt, in that order.
- Create a dataframe y containing sleep_rem.
- Initialize and fit a bagging regressor with 30 base estimators, a random state of 10, and oob_score=True.

Ex: If 10 base estimators are used, the output should be:

0.2322

[3.26   2.92   1.0333 2.3333 0.8    1.325  2.56   2.2667 0.8    2.38
 3.     0.5333 3.175  2.9667 0.7    0.65   1.825  2.2667 2.     1.
 0.6    1.1667 1.5    3.1    2.     1.9    4.15   1.3    0.75   1.2
 2.025  1.45   3.0286 2.72   0.5    2.0333 1.12   2.     2.65   1.65
 2.6667 2.3    1.45   0.58   2.625  1.6    0.74   1.3   ]

In [45]:
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingRegressor

df = pd.read_csv('msleep_clean.csv')

# Create a dataframe X containing the features awake, brainwt, and bodywt, in that order
X = df[['awake', 'brainwt', 'bodywt']]

# Create a dataframe y containing sleep_rem
y = df['sleep_rem']


# Initialize and fit bagging regressor with 30 base estimators, a random state of 10, and oob_score=True
sleepModel = BaggingRegressor(n_estimators=30, random_state=10, oob_score=True)
sleepModel.fit(X, y)

# Calculate out-of-bag accuracy
print(np.round(sleepModel.oob_score_, 4))

# Calculate predictions from out-of-bag estimate
print(np.round(sleepModel.oob_prediction_, 4))

0.3144
[3.1    2.98   0.8    1.6867 0.7167 1.8533 2.3091 2.0727 1.5231 2.1727
 2.95   0.5375 2.9417 2.8727 0.9    0.7765 1.9818 2.5    1.8692 1.57
 0.9692 1.1778 1.5769 2.75   2.4    2.1364 4.1727 1.6765 0.71   1.3909
 2.13   1.9733 3.1429 3.3533 0.51   2.0077 1.4    2.0143 2.45   1.975
 2.6    2.3    1.2    0.58   2.8222 1.9222 0.7417 1.24  ]


## 6.2. Random forests using scikit-learn 
The mpg_clean.csv dataset contains information on miles per gallon (mpg) and engine size for cars sold from 1970 through 1982. Dataframe X contains the input features mpg, cylinders, displacement, horsepower, weight, acceleration, and model_year. Dataframe y contains the output feature origin.

- Initialize and fit a random forest classifier with a user-input number of decision trees, estimator, a user-input number of features considered at each split, max_features, and a random state of 123.
- Calculate the prediction accuracy for the model.
- Read the documentation for the permutation_importance function from scikit-learn's inspection module.
- Calculate the permutation importance using the default parameters and a random state of 123.

Ex: When the input is

5

3

the output is:

0.9796


     | Feature          | Permutation Importance |
     |------------------|------------------------|
    2| displacement     | 0.453571               |
    0| mpg              | 0.160204               |
    4| weight           | 0.133673               |
    3| horsepower       | 0.107653               |
    5| acceleration     | 0.057143               |
    6| model_year       | 0.051531               |
    1| cylinders        | 0.012245               |




In [49]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

df = pd.read_csv('mpg_clean.csv')

# Create a dataframe X containing the input features
X = df.drop(columns=['name', 'origin'])
# Create a dataframe y containing the output feature origin
y = df[['origin']]

# Get user-input n_estimators and max_features (ask with different values)
estimators = int(input())
max_features = int(input())

# Initialize and fit a random forest classifier with user-input number of decision trees, 
# user-input number of features considered at each split, and a random state of 123
rfModel = RandomForestClassifier(n_estimators=estimators, max_features=max_features, random_state=123)
rfModel.fit(X, np.ravel(y))

# Calculate prediction accuracy
score = rfModel.score(X, y)
print(round(score, 4))

# Calculate the permutation importance using the default parameters and a random state of 123
result = permutation_importance(rfModel, X, y, random_state=123)

# Variable importance table
importance_table = pd.DataFrame(
    data={'feature': rfModel.feature_names_in_,'permutation importance': result.importances_mean}
).sort_values('permutation importance', ascending=False)

print(importance_table)

 5
 3


0.9796
        feature  permutation importance
2  displacement                0.453571
0           mpg                0.160204
4        weight                0.133673
3    horsepower                0.107653
5  acceleration                0.057143
6    model_year                0.051531
1     cylinders                0.012245


## 6.3. Boosting using scikit-learn 
The mpg.csv dataset contains information on miles per gallon (mpg) and engine size for cars sold from 1970 through 1982.

- Create a dataframe X containing the input features cylinders, weight, and mpg.
- Create a dataframe y containing the output feature origin.
- Initialize and fit an adaptive boosting classifier with a user-input learning rate lr and a random state of 123.
- Initialize and fit a gradient boosting classifier with a user-input learning rate lr and a random state of 123.
- Calculate the prediction accuracy for each model.

Ex: If the user-input learning rate is 0.6, the output is:

0.7688

0.995

In [51]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

mpg = pd.read_csv('mpg.csv')

# Create a dataframe X containing cylinders, weight, and mpg
X = mpg[['cylinders', 'weight', 'mpg']]

# Create a dataframe y containing origin
y = mpg[['origin']]

# Get user-input learning rate
lr = float(input())

# Initialize and fit an adaptive boosting classifier with the user-input learning rate and a 
# random state of 123
adaBoostModel = AdaBoostClassifier(learning_rate=lr, random_state=123)
adaBoostModel.fit(X, np.ravel(y))

# Initialize and fit a gradient boosting classifier with the user-input learning rate and a 
# random state of 123
gradientBoostModel = GradientBoostingClassifier(learning_rate=lr, random_state=123)
gradientBoostModel.fit(X, np.ravel(y))

# Calculate the prediction accuracy for the adaptive boosting classifier
adaBoostScore = adaBoostModel.score(X, y)
print(round(adaBoostScore, 4))

# Calculate the prediction accuracy for the gradient boosting classifier
gradientBoostScore = gradientBoostModel.score(X, y)
print(round(gradientBoostScore, 4))

 0.6


0.7688
0.995
