# Electric Motor Temperature Prediction

## Project Objective
Predict the temperature of electric motor components (specifically Permanent Magnet) using sensor data to prevent overheating and improve efficiency.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import math
warnings.filterwarnings('ignore')

%matplotlib inline

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pickle

## Data Collection & Loading

In [None]:
# Read the dataset
df = pd.read_csv('pmsm_temperature_data.csv')
df.head()

### Descriptive Analysis
We'll see which particular variables contribute to the rotor temperature individually by checking their statistical significance.

Descriptive analysis is to study the basic features of data with the statistical process. Here pandas have a worthy function called describe. With this described function we can understand the unique, top, and frequent values of categorical features. And we can find mean, std, min, max, and percentile values of continuous features.

**df.info():**
This function is used to display a brief introduction about the data set such as the number of rows and columns, the Data type of each column, whether the null values are present in the column or not.

**df.describe():**
This function is used to analyze the descriptive statistics of the data such as mean, median, quartile values, maximum and minimum values of each column.

In [None]:
df.info()

In [None]:
df.describe()

### Handling Missing Values
For checking the null values, df.isnull() function is used. To sum those null values we use .sum() function to it.

In [None]:
# Check for null values
df.isnull().sum()

**Conclusion:**
There are no null values in the dataset, so we can proceed without imputation.

## Exploratory Data Analysis (EDA)

### Uni-variate Analysis
Here we get to know about our data

**Bar Graph:**
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally.

In [None]:
plt.figure(figsize=(15,6))
try:
    df['profile_id'].value_counts().sort_values().plot(kind = 'bar')
    plt.title('Profile ID Distribution')
except KeyError:
    print("Column 'profile_id' not found in dataset. Skipping plot.")
plt.show()

### Box Plot & Distribution Plot

**Box Plot:**
A boxplot is a standardized way of displaying the distribution of data based on a five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are.

**Distribution Plot:**
The distribution plot is suitable for comparing range and distribution for groups of numerical data. Data is plotted as value points along an axis.

In [None]:
df.columns

In [None]:
#Plotting Distribution and Boxplot for all the features to check for skewness
for i in df.columns:
    if df[i].dtype != 'object':
        sns.distplot(df[i],color='g')
        sns.boxplot(df[i],color = 'y')
        plt.vlines(df[i].mean(),ymin = -1,ymax = 1,color = 'r')#drawing the mean line
        plt.show()

**Conclusion:**
As we can see from the above plots, the mean and median for most of the plots are very close to each other. So the data seems to have low skewness for almost all variables.

In [None]:
# Summary Statistics
df.describe()

### Handling Outliers
With the help of a boxplot, outliers are visualized (refer to activity 3 univariate analysis). And here we are going to find the upper bound and lower bound of the Na_to_K feature with some mathematical formula.

To find the upper bound we have to multiply IQR (Interquartile range) with 1.5 and add it with 3rd quantile. To find a lower bound instead of adding, subtract it with 1st quantile.
If outliers are removed, we lose more data. It will impact model performance.
Here removing outliers is impossible. So, the capping technique is used on outliers.
Capping: Replacing the outliers with upper bound values.

**Note:** In our Dataset all the values are in the same range, so outliers replacing is not necessary.

### Handling Categorical Values
As we can see our dataset has categorical data we must convert the categorical data to integer encoding or binary encoding.

To convert the categorical features into numerical features we use encoding techniques. There are several techniques but in our project, we are using feature mapping and label encoding.

**Note:** In our dataset, there is no categorical data type, so we can skip this step.

In [None]:
# Correlation Matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

### Multi-variate Analysis
Multivariate analysis (MVA) is a Statistical procedure for the analysis of data involving more than one type of measurement or observation. It may also mean solving problems where more than one dependent variable is analyzed simultaneously with other variables.

**Scatterplot:**
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.

**Heat-map:**
A heat map is a data visualization technique that shows the magnitude of a phenomenon as color in two dimensions.

In [None]:
plt.figure(figsize=(20,5))
# Compare stator components for a specific profile (e.g., profile 20)
# Note: Assuming profile_id '20' exists, if not code will pick available one or fail gracefully
try:
    target_profile = 20
    if 20 not in df['profile_id'].unique():
         target_profile = df['profile_id'].unique()[0]
    
    df[df['profile_id'] == target_profile]['stator_yoke'].plot(label = 'stator yoke')
    df[df['profile_id'] == target_profile]['stator_tooth'].plot(label = 'stator tooth')
    df[df['profile_id'] == target_profile]['stator_winding'].plot(label = 'stator winding')
    plt.legend()
    plt.title(f'Stator Temperatures for Profile {target_profile}')
except KeyError:
    print("profile_id column missing or data issue.")
plt.show()

In [None]:
plt.figure(figsize=(14,7))
sns.heatmap(df.corr(),annot=True)
plt.title('Correlation Heatmap with Annotation')
plt.show()

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(20, 5),sharey=True)
# Using scatterplot as requested
sns.scatterplot(x=df['ambient'],y=df['pm'],ax=axes[0][0])
sns.scatterplot(x=df['coolant'],y=df['pm'],ax=axes[0][1])
sns.scatterplot(x=df['motor_speed'],y=df['pm'],ax=axes[0][2])
sns.scatterplot(x=df['i_d'],y=df['pm'],ax=axes[0][3])
sns.scatterplot(x=df['u_q'],y=df['pm'],ax=axes[1][0])
sns.scatterplot(x=df['u_d'],y=df['pm'],ax=axes[1][1])
sns.scatterplot(x=df['i_q'],y=df['pm'],ax=axes[1][2])
# Remove empty subplot if any
axes[1][3].axis('off')

plt.show()

As we can see from the plot, all three stator components follow a similar measurement variance.
Due to this, we can infer that there has not been much time given for the motor to cool down in between recording the sensor data.
As profile_id is an id for each measurement session, we can remove it from any further analysis and model building.

In [None]:
if 'profile_id' in df.columns:
    df.drop('profile_id',axis = 1,inplace=True)
    print("Dropped profile_id")
else:
    print("profile_id already dropped or not found")

### Drop unwanted features
As we want to predict the temperatures of stator components and rotor(pm), we will drop these values from our dataset for regression. Also, torque is a quantity, which is not reliably measurable in field applications, so this feature shall be omitted in this modeling.

Dropping the columns from the dataset is being concluded with the help of a scatter plot, which is available in the data analysis part.

In [None]:
drop_cols = ['torque', 'stator_yoke', 'stator_tooth', 'stator_winding']
# Check if they exist before dropping to allow re-running
existing_cols_to_drop = [col for col in drop_cols if col in df.columns]
if existing_cols_to_drop:
    df.drop(existing_cols_to_drop, axis=1, inplace=True)
    print(f"Dropped columns: {existing_cols_to_drop}")
else:
    print("Unwanted columns already dropped or not found")

## Data Preprocessing

In [None]:
target = 'pm'
features = ['ambient', 'coolant', 'u_d', 'u_q', 'motor_speed', 'i_d', 'i_q']

X = df[features]
y = df[target]

### Splitting data into train and test
Now let’s split the Dataset into train and test sets. For splitting training and testing data, we are using the train_test_split() function from sklearn. As parameters, we are passing X, y, test_size, random_state.

In [None]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Normalizing the values
We are using minmax scaler, which is a function in preprocessing module in sklearn library.

In [None]:
# Scaling
# Normalizing the values using MinMaxScaler as per requirement
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model Building & Evaluation
We will evaluate three models: Linear Regression, Decision Tree, and Random Forest to see which performs best.

In [None]:
# Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = math.sqrt(mse_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print("Linear Regression Results:")
print(f"RMSE: {rmse_lr}")
print(f"R2 Score: {r2_lr}")

In [None]:
# Decision Tree
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train_scaled, y_train)
y_pred_dt = dt.predict(X_test_scaled)
mse_dt = mean_squared_error(y_test, y_pred_dt)
rmse_dt = math.sqrt(mse_dt)
r2_dt = r2_score(y_test, y_pred_dt)

print("Decision Tree Results:")
print(f"RMSE: {rmse_dt}")
print(f"R2 Score: {r2_dt}")

In [None]:
# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)
y_pred_rf = rf.predict(X_test_scaled)
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = math.sqrt(mse_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Results:")
print(f"RMSE: {rmse_rf}")
print(f"R2 Score: {r2_rf}")

### Model Comparison
Out of all the models. The decision Tree regressor is giving an r2-score of 96%, it means the model is able to explain 96% of the data. so we will select the decision tree model and save it.

## Save The Model
Save the model

In [None]:
# Saving the best model (Decision Tree)
final_model = dt

with open('model.save', 'wb') as f:
    pickle.dump(final_model, f)

with open('transform.save', 'wb') as f:
    pickle.dump(scaler, f)

print("Model (Decision Tree) and Scaler saved.")