# Linear Regression

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable

This week, your task involves conducting multi-class linear regression on batsmen salaries. You'll use the average runs scored per game and the strike rate as independent variables. The goal is to predict the salary as the dependent variable. Additionally, you'll be categorizing the data based on the years.

The dataset is Data_Mendeley.csv given on GitHub. Feel free to create any new functions required.

In [3]:
#import important libraries
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import pandas as pd

preparing data

In [4]:
#mounting gdrive
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/My Drive/wids/Data_Mendeley.csv'
data=pd.read_csv(file_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Implement Linear regression here :)

In [6]:
data = np.genfromtxt(file_path, delimiter=',', dtype=None, names=True, encoding='utf-8')
years = data['Year']
average_runs = data['Ave']
strike_rate = data['StrRate']
salary = data['Final_Price']
average_runs = np.nan_to_num(average_runs, nan=np.nanmean(average_runs))
strike_rate = np.nan_to_num(strike_rate, nan=np.nanmean(strike_rate))
salary = np.nan_to_num(salary, nan=np.nanmean(salary))
# Function to compute regression coefficients using the normal equation
def compute_coefficients(X, y):
    X_with_intercept = np.hstack((np.ones((X.shape[0], 1)), X))  # Add intercept
    beta = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y
    return beta

# Train and evaluate models for each year
unique_years = np.unique(years)
models = {}
performance = []

for year in unique_years:
    # Filter data for the current year
    mask = years == year
    X_year = np.column_stack((average_runs[mask], strike_rate[mask]))
    y_year = salary[mask]

    # Compute regression coefficients
    beta = compute_coefficients(X_year, y_year)
    models[year] = beta

    # Predict salaries
    X_with_intercept = np.hstack((np.ones((X_year.shape[0], 1)), X_year))
    y_pred = X_with_intercept @ beta

    # Compute metrics
    mse = np.mean((y_year - y_pred) ** 2)
    ss_total = np.sum((y_year - np.mean(y_year)) ** 2)
    ss_residual = np.sum((y_year - y_pred) ** 2)
    r2 = 1 - (ss_residual / ss_total)

    # Store performance metrics
    performance.append({'Year': year, 'MSE': mse, 'R2': r2})

# Display performance metrics
for result in performance:
    print(f"Year {result['Year']}: MSE = {result['MSE']:.2f}, R² = {result['R2']:.2f}")


Year 2008: MSE = 194441927429757.91, R² = 0.09
Year 2009: MSE = 277463934027205.34, R² = 0.07
Year 2010: MSE = 207618680931127.41, R² = 0.20
Year 2011: MSE = 633059295757836.00, R² = 0.17
Year 2012: MSE = 734227247538172.50, R² = 0.08
Year 2013: MSE = 621778479541908.50, R² = 0.18
Year 2014: MSE = 919218445895563.38, R² = 0.25
Year 2015: MSE = 966581971290133.25, R² = 0.16
Year 2016: MSE = 721748414378591.88, R² = 0.22
Year 2017: MSE = 948609867206115.38, R² = 0.13


# Logistic Regression

Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on.

In this week you will be doing logistic regression on breast cancer dataset using sklearn library. Feel free to create any new functions required.

In [None]:
#importinf libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Prepare Data

In [None]:
breast_cancer = datasets.load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target

In [None]:
#spliting data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Implement Logistic Regression here :)