### Introduction

The May edition of the 2022 Tabular Playground series binary classification problem that includes a number of different feature interactions. This competition is an opportunity to explore various methods for identifying and exploiting these feature interactions.

In this dataset, given (simulated) manufacturing control data and  to predict whether the machine is in state 0 or state 1. The data has various feature interactions that may be important in determining the machine state.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


### Load the data

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/test.csv')
submission_df = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/sample_submission.csv')

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
print("The shape of the dataset: ", train_df.shape)
print("The shape of the dataset: ", test_df.shape)

In [None]:
train_df.describe()

In [None]:
print("Null values in the train dataset: ", train_df.isnull().sum())
print("Null values in the test dataset: ", test_df.isnull().sum())


In [None]:
print("Check the datatypes: ", train_df.dtypes)
print("Check the datatypes: ", test_df.dtypes)

From this, we can infer that f27 is a object column

In [None]:
#shows basic statistics of each categoric column of the dataframe (f_27)
train_df.describe(include=['object'])

f27 is the only numeric column which has many unique letters. It may be not useful.

The Target variable is a binary class which predicts either 0 or 1.

In [None]:
#target count

train_df['target'].value_counts()


### EDA

In [None]:
#Convert float to int for memory saving and fast execution
#train data
float_list_train = train_df.select_dtypes(include=[np.float64]).columns
int_list_train = train_df.select_dtypes(include=[np.int64]).columns
#test data
float_list_test = test_df.select_dtypes(include=[np.float64]).columns
int_list_test = test_df.select_dtypes(include=[np.int64]).columns


In [None]:
fig, axs = plt.subplots(4,4 ,figsize=(16,16))
for f, ax in zip(float_list_train,axs.ravel()):
  ax.hist(train_df[f], density=True, bins=100)
  ax.set_title(f'Train {f}, std={train_df[f].std():.1f}')
plt.suptitle('Histograms of the float features',y=0.93,fontsize=20)
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (25,20)
#The below code will display value counts of the categoric feature 'f_27'
train_df['f_27'].value_counts()[:50].plot(kind='bar') #shows the top 50 common values
plt.title('f_27 Top 50 Most Common Values', {'size': '35'}) #Adds title
plt.show() #displays figure

In [None]:
import seaborn as sns
# Correlation matrix of float features
plt.figure(figsize=(16,16))
sns.heatmap(train_df[float_list_train].corr(),center=0,annot=True,fmt='.3f')

In [None]:
#Correlation matrix of target variable

plt.figure(figsize=(30, 2))         
sns.heatmap(train_df.corr()[-1:], 
            cmap="viridis",         
            annot=True              
           )
plt.title('Correlation with Target Feature', {'size': '35'}) #sets title
plt.show() #displays figure

 This shows us that these features have some predictive power on their own in predicting the target feature. The distribution shift was not large so they will not be very useful on their own. In our bivariate analysis of f_19 and f_21 (scatter plot) we observed different target values in different areas of the feature space. This means combining features has greater predictive power.

### Preprocess the data


In [None]:
df_frequency_map = train_df.f_27.value_counts().to_dict()
train_df.f_27 = train_df.f_27.map(df_frequency_map)
train_df.f_27.head()

In [None]:
#Extracting Independent and dependent Variable 

X= train_df.iloc[:,1:32 ].values  
y= train_df.iloc[:, 32].values  

### Modelling

In [None]:
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(X, y, test_size= 0.05, random_state=0)

In [None]:
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)  

#### Logistic Regression

In [None]:
#Fitting Logistic Regression to the training set  
from sklearn.linear_model import LogisticRegression  
classifier= LogisticRegression(random_state=0)  
classifier.fit(x_train, y_train)  

In [None]:
#Predicting the test set result  
y_pred= classifier.predict(x_test)  

In [None]:
from sklearn.metrics import confusion_matrix  
cm= confusion_matrix(y_test,y_pred)  

In [None]:
print(cm)


In [None]:
import seaborn as sns

ax = sns.heatmap(cm, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

## Display the visualization of the Confusion Matrix.
plt.show()

### Random forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier() 
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(x_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(x_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics 
print()
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(y_test, y_pred))

In [None]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(clf.score(x_train, y_train)))

print('Test set score: {:.4f}'.format(clf.score(x_test, y_test)))

###  LightGBM

In [None]:
import lightgbm as lgb
clf2 = lgb.LGBMClassifier()
clf2.fit(x_train, y_train)

In [None]:
# predict the results
y_pred=clf2.predict(x_test)

In [None]:
# view accuracy
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_pred, y_test)
print('LightGBM Model accuracy score: {0:0.4f}'.format(accuracy_score(y_test, y_pred)))

In [None]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(clf2.score(x_train, y_train)))

print('Test set score: {:.4f}'.format(clf2.score(x_test, y_test)))

### Prediction

In [None]:
test_df.head()

In [None]:
df_frequency_map2 = test_df.f_27.value_counts().to_dict()
test_df.f_27 = test_df.f_27.map(df_frequency_map2)
test_df.f_27.head()

In [None]:
x_test2= test_df.iloc[:,1:32 ].values  
y_pred_test= clf2.predict(x_test2)  

In [None]:
y_pred_test.shape


In [None]:
submit = test_df[['id']]
submit['target'] = y_pred_test
submit.head()


In [None]:
submit.to_csv("Submit.csv",index=None)

In [None]:
s = pd.read_csv('/kaggle/working/Submit.csv')
s.head()