<a href="https://colab.research.google.com/github/wendywqz/GenAI/blob/main/Applying_metrics_and_cross_validationipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
pip install numpy pandas scikit-learn



In [11]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, r2_score, classification_report

Use **StudyHours** and **PrevExamScore** as *`features`* and **Pass** (0 = Fail, 1 = Pass) as the*` target variable`*

In [4]:
# Sample dataset: Study hours, previous exam scores, and pass/fail labels
data = {
    'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'PrevExamScore': [30, 40, 45, 50, 60, 65, 70, 75, 80, 85],
    'Pass': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # 0 = Fail, 1 = Pass
}

df = pd.DataFrame(data)

# Features and target variable
X = df[['StudyHours', 'PrevExamScore']]
y = df['Pass']

df.head()

Unnamed: 0,StudyHours,PrevExamScore,Pass
0,1,30,0
1,2,40,0
2,3,45,0
3,4,50,0
4,5,60,0


## **Applying evaluation metrics without cross-validation**

In [5]:
from sklearn.linear_model import LogisticRegression

# Split the data into training and testing set 80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

Next, calculate the model’s accuracy, precision, recall, and F1 score using the test set predictions:

In [6]:
# Calculate metrics
accruracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accruracy}")
print(f"Classification Report:\n{report}")

Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



Accuracy measures the proportion of correct predictions.

Precision indicates how many predicted positives were correct.

Recall measures how many actual positives were correctly predicted.

F1 score is a balance between precision and recall.

In [14]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-Score: {f1}')

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0


## **Introducing cross-validation**

While the above method works, it’s limited by the single train-test split, which could lead to overfitting or underfitting. To get a more reliable performance estimate, use cross-validation. (cross-validation allows you to split the dataset into multiple subsets and reliably calculate model performance).

Cross-validation involves splitting the data into multiple folds, training the model on some folds, and testing it on the remaining folds. The process is repeated for each fold, and the average performance is taken across all folds.

In [29]:
# k-fold cross-validation, the dataset is split inot k equal parts (folds).
# Each fold is used as a test set while the remaining folds are used for training:

from sklearn.model_selection import cross_val_score

# Initialize the model
model = LogisticRegression()

# Perform 5-fold cross-validation and calculate accuracy for each fold
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

# Display the accuracy for each fold and the mean accuracy
print(f'Cross-validation accuracies: {cv_scores}')
print(f'Mean cross-validation accuracy: {np.mean(cv_scores)}')

Cross-validation accuracies: [1.  1.  1.  1.  0.5]
Mean cross-validation accuracy: 0.9


Here, the cross_val_score function automatically splits the data into five folds, trains the model on four folds, and tests it on the remaining fold. This process is repeated five times, and it reports the accuracy for each fold.

In [28]:
#TESTING
# Initialize the model
model = LogisticRegression()
print(model)

from sklearn.model_selection import cross_validate # Import cross_validate
cv = cross_validate(model, X, y, cv=5) # Use cross_validate instead of cross_validates
print(cv)

from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation and calculate accuracy for each fold
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

# Print the cross-validation scores
print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean cross-validation accuracy: {np.mean(cv_scores)}')

LogisticRegression()
{'fit_time': array([0.00832343, 0.00806284, 0.0064888 , 0.00650382, 0.00675964]), 'score_time': array([0.00191498, 0.00176835, 0.00169468, 0.00180387, 0.00171876]), 'test_score': array([1. , 1. , 1. , 1. , 0.5])}
Cross-Validation Scores: [1.  1.  1.  1.  0.5]
Mean cross-validation accuracy: 0.9


## **Cross-validation with multiple metrics**

Calculate multiple metrics during cross-validation using the scoring parameter. Use k-fold cross-validation to calculate accuracy, precision, recall, and F1 score:



In [30]:
from sklearn.model_selection import cross_validate

# Define multiple scoring metrics
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Perform cross-validation
cv_results = cross_validate(model, X, y, cv=5, scoring=scoring)

# Print results for each metric
print(f"Cross-validation Accuracy: {np.mean(cv_results['test_accuracy'])}")
print(f"Cross-validation Precision: {np.mean(cv_results['test_precision'])}")
print(f"Cross-validation Recall: {np.mean(cv_results['test_recall'])}")
print(f"Cross-validation F1-Score: {np.mean(cv_results['test_f1'])}")

Cross-validation Accuracy: 0.9
Cross-validation Precision: 0.9
Cross-validation Recall: 1.0
Cross-validation F1-Score: 0.9333333333333333


##**Cross-validation with a regression model**
For regression tasks, use metrics such as mean absolute error (MAE), mean squared error (MSE), and R-squared. Apply these metrics with cross-validation for a regression model:

**R-squared** indicates how well the model explains the variance in the target variable.

**MSE and MAE** measure the average error between the predicted and actual values.

In [31]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Sample dataset for regression
X_reg = df[['StudyHours']]
y_reg = df['PrevExamScore']

# Initialize a linear regression model
reg_model = LinearRegression()

# Perform 5-fold cross-validation using R-squared as the metric
cv_scores_r2 = cross_val_score(reg_model, X_reg, y_reg, cv=5, scoring='r2')

print(f'Cross-validation R-squared scores: {cv_scores_r2}')
print(f'Mean R-squared score: {np.mean(cv_scores_r2)}')

Cross-validation R-squared scores: [ 0.52933673  0.88503086 -0.60298929  0.88503086 -1.28939909]
Mean R-squared score: 0.08140201560607148
