<a href="https://colab.research.google.com/github/robdnh/ml_course/blob/main/fault_detect_logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Import required libraries and download data

In [10]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from scipy.stats import pearsonr
from git import Repo

Repo.clone_from('https://github.com/robdnh/data.git', '/content/data')

<git.repo.base.Repo '/content/data/.git'>

### 1. Identify a problem we'd like to solve

<em>Given current and voltage, can we train a logistic regression model that can identify whether or not there is a fault in the line?</em>

### 2. Load the data in to a pandas dataframe for easy inspection/manipulation/feature engineering

In [14]:
# Load dataset from the specified CSV file
# Data source: https://www.kaggle.com/datasets/esathyaprakash/electrical-fault-detection-and-classification?resource=download&select=detect_dataset.csv
import os
os.getcwd()
df = pd.read_csv('data/logistic-regression/fault-detect-dataset.csv')



**Exercise**: <em> Looking at this dataset, what are our features and what is our label? Is our label binary or categorical?

**Exercise**: <em> Values in a dataset used for logistic regression should be relatively independent. How do we check if variables are independent?</em>

In [15]:
df.corr().round(3)

Unnamed: 0,Output (S),Ia,Ib,Ic,Va,Vb,Vc,Unnamed: 7,Unnamed: 8
Output (S),1.0,0.039,-0.134,0.12,-0.035,0.012,0.023,,
Ia,0.039,1.0,-0.375,-0.276,0.033,-0.158,0.13,,
Ib,-0.134,-0.375,1.0,-0.53,-0.027,0.032,-0.006,,
Ic,0.12,-0.276,-0.53,1.0,-0.002,-0.096,0.1,,
Va,-0.035,0.033,-0.027,-0.002,1.0,-0.508,-0.471,,
Vb,0.012,-0.158,0.032,-0.096,-0.508,1.0,-0.52,,
Vc,0.023,0.13,-0.006,0.1,-0.471,-0.52,1.0,,
Unnamed: 7,,,,,,,,,
Unnamed: 8,,,,,,,,,


A positive correlation between two values implies how they will move in a similar direction, while a negative correlation implies they will move in an opposit direction.

**Exercise**: <em> Are there fields with a positive and negative correlation? Do fields seem sufficiently independent? Is this an optimal dataset for logistic regression?
**Bonus**: Does anyone know why given the dataset?
</em>


### 3. Normalize values of the dataset, if necessary

In [33]:
#Remove unnecessary columns

df.drop('Unnamed: 7', axis=1, inplace=True)
df.drop('Unnamed: 8', axis=1, inplace=True)

# Fill missing values using forward fill (propagates last valid observation forward)
df.ffill(inplace=True)

### 4. Split the dataset in to features (X) and a label (y)

In [34]:
X = df.drop('Output (S)', axis=1)
y = df['Output (S)']


print(X)

#~~~~~~~~~~~~~~~~~

print(y)

               Ia         Ib          Ic        Va        Vb        Vc
0     -170.472196   9.219613  161.252583  0.054490 -0.659921  0.605431
1     -122.235754   6.168667  116.067087  0.102000 -0.628612  0.526202
2      -90.161474   3.813632   86.347841  0.141026 -0.605277  0.464251
3      -79.904916   2.398803   77.506112  0.156272 -0.602235  0.445963
4      -63.885255   0.590667   63.294587  0.180451 -0.591501  0.411050
...           ...        ...         ...       ...       ...       ...
11996  -66.237921  38.457041   24.912239  0.094421 -0.552019  0.457598
11997  -65.849493  37.465454   25.515675  0.103778 -0.555186  0.451407
11998  -65.446698  36.472055   26.106554  0.113107 -0.558211  0.445104
11999  -65.029633  35.477088   26.684731  0.122404 -0.561094  0.438690
12000  -64.598401  34.480799   27.250065  0.131669 -0.563835  0.432166

[12001 rows x 6 columns]
0        0
1        0
2        0
3        0
4        0
        ..
11996    0
11997    0
11998    0
11999    0
12000    0
N

### 5. Split the data set in to a training (x_train, y_train) and test (y_test, y_train) data set.

In [35]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=39)

### 6. Establish and train a model

In [36]:
# Initialize logistic regression model with lbfgs solver
model = LogisticRegression(solver='lbfgs')

# Train the model on the training data
model.fit(x_train, y_train)

### 7. Predict labels (y_pred) associated with the test features (x_test)

In [38]:
# Make predictions on the test set
y_pred = model.predict(x_test)
print(y_pred)

[0 1 1 ... 0 0 0]


### 8. Measure the accuracy of our predictions to evaluate performance

In [40]:
# sklearn accuracy is used for classification tasks
# Calculate and store accuracy score of the model
accuracy = accuracy_score(y_test, y_pred)
print(str(round(accuracy*100, 2)) + " %")

73.72 %
