Skip to content

This repository has the code for implementation of Principal Component Analysis, Upsampling (SMOTE), Downsampling (Random Undersampler) and combined via SMOTETomek.

Notifications You must be signed in to change notification settings

shivtosh/Feature-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Feature-engineering

DATA DESCRIPTION: sensor-data.csv : (1567, 592).

  • The data consists of 1567 examples each with 591 features.
  • The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing.
  • Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.

Steps and tasks:

  1. Import and explore the data.
  2. Data cleansing:

• Missing value treatment. • Drop attribute/s if required using relevant functional knowledge. • Make all relevant modifications on the data using both functional/logical reasoning/assumptions.

  1. Data analysis & visualisation:

• Perform detailed relevant statistical analysis on the data. • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

  1. Data pre-processing:

• Segregate predictors vs target attributes • Check for target balancing and fix it if found imbalanced. • Perform train-test split and standardise the data or vice versa if required. • Check if the train and test data have similar statistical characteristics when compared with original data.

  1. Model training, testing and tuning: • Model training:
  • Pick up a supervised learning model.
  • Train the model.
  • Use cross validation techniques.
  • Apply hyper-parameter tuning techniques to get the best accuracy. Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
  • Use any other technique/method which can enhance the model performance.
  • Display and explain the classification report in detail.
  • Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.
  • Apply the above steps for all possible models that you have learnt so far.
  • Display and compare all the models designed with their train and test accuracies.
  • Select the final best trained model along with your detailed comments for selecting this model.
  • Pickle the selected model for future use.
  • Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results.
  1. Conclusion and improvisation:
  • PCA was implemented to reduce the features from 478 to 200 that essentially explained 98% of the variance in the data.
  • In order to balance the samples amongst all class labels, oversampling (via SMOTE) and undersampling (via Random undersampling) was performed.
  • Combined Oversampling and undersampling was performed via SMOTETomek.
  • Results concluded that Oversampling via SMOTE gave the best results i.e. 80.57% test accuracy for Logistic regression model and 95.54% accuracy for Support Vector Classifier.
  • GridSearch Cross Validation was also used for finding the best hyperparameters. for Logistic Regresssion.

About

This repository has the code for implementation of Principal Component Analysis, Upsampling (SMOTE), Downsampling (Random Undersampler) and combined via SMOTETomek.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published