For this case study, you will perform a classification task on a WiFi dataset. You will use WLAN fingerprints to identify the location of a user. You will identify locations using the building numbers and floor numbers only. 

You will also explore the question, "is more data useful for a classification task?"

The dataset you will use can be found on: https://archive.ics.uci.edu/ml/datasets/ujiindoorloc .

**\[Step 1\]** Once you examine the data sets, you will find that there is a training set and a validation set. However, you must also create a test set that has the same number of samples as the validation set. You can select and remove random samples from the training set and use them to create a test set. The test set should not be used in the training process or to optimize the parameters of any algorithm you use. The test set should only be used to report the final performance of a model whenever necessary.

You may need to determine the features and labels of your model. You can also do some engineering on features and labels if necessary.

**\[Step 2\]** But, which algorithm should you use with your model? You can refer to the scikit-learn cheat sheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html, and try three algorithms. Some suggestions are: LinearSVC, Logistic Regression, KNN classifier, SVC, Random Forest (as an example of Ensemble Learning) etc. Perform one experiment using each and observe the performance of each model. Note which is the best performing model.

**\[Step 3\]** Once the previous step is done, observe if more data is useful for a classification task. For this, randomly select 20% of the training samples, but keep the size of the test set the same. You wil not use the validation set in this step as you will not optimize the model in any way. Note the performance. Then also try with 40%, 60%, 80% and 100% of the training samples. Perform three experiments for each selection. This means, for 20% you will do three experiments, 40% three experiments etc. Find the average of three experiments for each selection and plot them using a method of your choice.

**\[Step 4\]** Publish your finding in presentation slides. Like case study 1, three of you will be randomly chosen to present your work in front of the class. The slides should inform the audience about:

* the objective of the case study
* the data (features and labels)
* things you have done (e.g. why you selected a specific classification model)
* challenges you have faced that might be interesting to your classmates
* your findings


**Things to note**:

* **Type of task**: classification
* **Features**: you choose
* **Feature engineering**: You are welcome to do so.
* **Labels**: User locations. Use building and floor IDs, but ignore the SPACEID column.

* In some cases, normalization may result in reduced accuracy.
* You must write enough comments so that anybody with some programming knowledge can understand your code.

Also,
* This is not a group project. But if you think you will benefit from working with a partner, you are welcome to find a partner. No points will be deducted if you choose to do so. However, you must inform Himan (the TA) and me (Prof. Ghoshal) by **September 25, 2023** in that case.


**Grading Criteria**:

* [15 + 15] Data set preparation: Choosing your $X$ (features) and $y$ (label). Feature Engineering.
* [15 + 15 + 15] Three experiments using three algorithms.  
* [15] Observing the effects of more data using five sets of random samples of different sizes from the training set. 
* [10] Presentation slides and presentation.

**What to submit**:

Put the Jupyter Notebook file and the .csv file in a folder. Then convert your presentation slides to a PDF file and put it in the same folder. Zip the folder. After zipping, it should have the extension .zip. The name of the .zip file should be firstname_lastname_casestudy_2.zip . Upload the .zip file on Canvas.

In [374]:
# start here
# create as many cells as needed

In [375]:
import numpy as np
import pandas as pd
import time
import pprint
import matplotlib.pyplot as mp
import seaborn as sb

In [376]:
Training_set = pd.read_csv("C:/Users/user/Downloads/Fall - 23/Introduction to Data Science/Assignments/CS - 02/UJIndoorLoc/trainingData.csv")
Training_set.head()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP520,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP
0,100,100,100,100,100,100,100,100,100,100,...,100,-7541.2643,4864921.0,2,1,106,2,2,23,1371713733
1,100,100,100,100,100,100,100,100,100,100,...,100,-7536.6212,4864934.0,2,1,106,2,2,23,1371713691
2,100,100,100,100,100,100,100,-97,100,100,...,100,-7519.1524,4864950.0,2,1,103,2,2,23,1371714095
3,100,100,100,100,100,100,100,100,100,100,...,100,-7524.5704,4864934.0,2,1,102,2,2,23,1371713807
4,100,100,100,100,100,100,100,100,100,100,...,100,-7632.1436,4864982.0,0,0,122,2,11,13,1369909710


In [377]:
Training_set.describe()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP520,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP
count,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,...,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0,19937.0
mean,99.823644,99.820936,100.0,100.0,99.613733,97.130461,94.733661,93.820234,94.693936,99.163766,...,100.0,-7464.275947,4864871.0,1.674575,1.21282,148.429954,1.833024,9.068014,13.021869,1371421000.0
std,5.866842,5.798156,0.0,0.0,8.615657,22.93189,30.541335,33.010404,30.305084,12.634045,...,0.0,123.40201,66.93318,1.223078,0.833139,58.342106,0.372964,4.98872,5.36241,557205.4
min,-97.0,-90.0,100.0,100.0,-97.0,-98.0,-99.0,-98.0,-98.0,-99.0,...,100.0,-7691.3384,4864746.0,0.0,0.0,1.0,1.0,1.0,1.0,1369909000.0
25%,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,-7594.737,4864821.0,1.0,0.0,110.0,2.0,5.0,8.0,1371056000.0
50%,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,-7423.0609,4864852.0,2.0,1.0,129.0,2.0,11.0,13.0,1371716000.0
75%,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,-7359.193,4864930.0,3.0,2.0,207.0,2.0,13.0,14.0,1371721000.0
max,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,-7300.81899,4865017.0,4.0,2.0,254.0,2.0,18.0,24.0,1371738000.0


In [378]:
Training_set.corr()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP520,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP
WAP001,1.000000,-0.000928,,,-0.001348,-0.003762,-0.005184,0.004170,-0.005263,-0.001990,...,,0.035730,-0.054910,-0.025719,0.043761,0.016777,-0.013458,-0.011642,0.000123,0.063228
WAP002,-0.000928,1.000000,,,-0.001385,-0.003865,-0.005326,-0.005782,-0.005408,-0.002044,...,,0.050326,-0.021718,-0.021374,0.044959,-0.035616,-0.013827,0.049948,-0.005633,0.020383
WAP003,,,,,,,,,,,...,,,,,,,,,,
WAP004,,,,,,,,,,,...,,,,,,,,,,
WAP005,-0.001348,-0.001385,,,1.000000,-0.005610,-0.007731,-0.008393,-0.007850,-0.002968,...,,-0.054699,0.048553,-0.042142,-0.042362,0.008474,0.043336,-0.023651,0.020221,-0.024054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SPACEID,0.016777,-0.035616,,,0.008474,-0.146094,-0.199881,0.129147,-0.209836,0.131951,...,,-0.102063,-0.155612,-0.037205,-0.129869,1.000000,0.042184,-0.203930,-0.012862,-0.179500
RELATIVEPOSITION,-0.013458,-0.013827,,,0.043336,0.084678,-0.062160,0.073497,-0.073372,0.011871,...,,-0.151616,0.120937,0.161936,-0.149405,0.042184,1.000000,-0.113595,0.034807,-0.227320
USERID,-0.011642,0.049948,,,-0.023651,0.037283,-0.066782,-0.034706,0.131440,-0.002092,...,,0.347764,-0.227728,-0.185551,0.338069,-0.203930,-0.113595,1.000000,-0.116192,0.130628
PHONEID,0.000123,-0.005633,,,0.020221,0.134668,0.000703,0.007552,-0.017822,-0.068594,...,,-0.072975,-0.034537,0.167536,-0.038654,-0.012862,0.034807,-0.116192,1.000000,-0.029279


In [379]:
Training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19937 entries, 0 to 19936
Columns: 529 entries, WAP001 to TIMESTAMP
dtypes: float64(2), int64(527)
memory usage: 80.5 MB


In [380]:
Training_set.isnull().all(axis=0)

WAP001              False
WAP002              False
WAP003              False
WAP004              False
WAP005              False
                    ...  
SPACEID             False
RELATIVEPOSITION    False
USERID              False
PHONEID             False
TIMESTAMP           False
Length: 529, dtype: bool

In [381]:
Training_set.corr()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP520,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP
WAP001,1.000000,-0.000928,,,-0.001348,-0.003762,-0.005184,0.004170,-0.005263,-0.001990,...,,0.035730,-0.054910,-0.025719,0.043761,0.016777,-0.013458,-0.011642,0.000123,0.063228
WAP002,-0.000928,1.000000,,,-0.001385,-0.003865,-0.005326,-0.005782,-0.005408,-0.002044,...,,0.050326,-0.021718,-0.021374,0.044959,-0.035616,-0.013827,0.049948,-0.005633,0.020383
WAP003,,,,,,,,,,,...,,,,,,,,,,
WAP004,,,,,,,,,,,...,,,,,,,,,,
WAP005,-0.001348,-0.001385,,,1.000000,-0.005610,-0.007731,-0.008393,-0.007850,-0.002968,...,,-0.054699,0.048553,-0.042142,-0.042362,0.008474,0.043336,-0.023651,0.020221,-0.024054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SPACEID,0.016777,-0.035616,,,0.008474,-0.146094,-0.199881,0.129147,-0.209836,0.131951,...,,-0.102063,-0.155612,-0.037205,-0.129869,1.000000,0.042184,-0.203930,-0.012862,-0.179500
RELATIVEPOSITION,-0.013458,-0.013827,,,0.043336,0.084678,-0.062160,0.073497,-0.073372,0.011871,...,,-0.151616,0.120937,0.161936,-0.149405,0.042184,1.000000,-0.113595,0.034807,-0.227320
USERID,-0.011642,0.049948,,,-0.023651,0.037283,-0.066782,-0.034706,0.131440,-0.002092,...,,0.347764,-0.227728,-0.185551,0.338069,-0.203930,-0.113595,1.000000,-0.116192,0.130628
PHONEID,0.000123,-0.005633,,,0.020221,0.134668,0.000703,0.007552,-0.017822,-0.068594,...,,-0.072975,-0.034537,0.167536,-0.038654,-0.012862,0.034807,-0.116192,1.000000,-0.029279


In [382]:
Training_set['Location'] = Training_set['BUILDINGID'].astype(str) + '-' + Training_set['FLOOR'].astype(str)

Training_set.head()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP,Location
0,100,100,100,100,100,100,100,100,100,100,...,-7541.2643,4864921.0,2,1,106,2,2,23,1371713733,1-2
1,100,100,100,100,100,100,100,100,100,100,...,-7536.6212,4864934.0,2,1,106,2,2,23,1371713691,1-2
2,100,100,100,100,100,100,100,-97,100,100,...,-7519.1524,4864950.0,2,1,103,2,2,23,1371714095,1-2
3,100,100,100,100,100,100,100,100,100,100,...,-7524.5704,4864934.0,2,1,102,2,2,23,1371713807,1-2
4,100,100,100,100,100,100,100,100,100,100,...,-7632.1436,4864982.0,0,0,122,2,11,13,1369909710,0-0


In [383]:
Training_set.drop(columns=['BUILDINGID','FLOOR','SPACEID','USERID','RELATIVEPOSITION','PHONEID','TIMESTAMP'],axis=1,inplace=True)

In [384]:
Test_data = Training_set.sample(n=1111, random_state=42)

# Dropping the selected rows from the training dataset to create the new training dataset

Training_data = Training_set.drop(Test_data.index)

In [385]:
Training_data.head()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP514,WAP515,WAP516,WAP517,WAP518,WAP519,WAP520,LONGITUDE,LATITUDE,Location
0,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7541.2643,4864921.0,1-2
1,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7536.6212,4864934.0,1-2
2,100,100,100,100,100,100,100,-97,100,100,...,100,100,100,100,100,100,100,-7519.1524,4864950.0,1-2
3,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7524.5704,4864934.0,1-2
4,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7632.1436,4864982.0,0-0


In [386]:
Test_data.head()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP514,WAP515,WAP516,WAP517,WAP518,WAP519,WAP520,LONGITUDE,LATITUDE,Location
10958,100,100,100,100,100,100,100,100,-96,100,...,100,100,100,100,100,100,100,-7646.7758,4864926.0,0-3
12425,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7474.5537,4864867.0,1-1
322,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7349.2796,4864759.0,2-3
2393,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7369.4144,4864768.0,2-3
5343,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7414.87347,4864881.0,1-2


In [389]:
Validation_data = pd.read_csv("C:/Users/user/Downloads/Fall - 23/Introduction to Data Science/Assignments/CS - 02/UJIndoorLoc/validationData.csv")
Validation_data.head()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP520,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP
0,100,100,100,100,100,100,100,100,100,100,...,100,-7515.916799,4864890.0,1,1,0,0,0,0,1380872703
1,100,100,100,100,100,100,100,100,100,100,...,100,-7383.867221,4864840.0,4,2,0,0,0,13,1381155054
2,100,100,100,100,100,100,100,100,100,100,...,100,-7374.30208,4864847.0,4,2,0,0,0,13,1381155095
3,100,100,100,100,100,100,100,100,100,100,...,100,-7365.824883,4864843.0,4,2,0,0,0,13,1381155138
4,100,100,100,100,100,100,100,100,100,100,...,100,-7641.499303,4864922.0,2,0,0,0,0,2,1380877774


In [390]:
Validation_data['Location'] = Validation_data['BUILDINGID'].astype(str) + '-' + Validation_data['FLOOR'].astype(str)

Validation_data.head()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP,Location
0,100,100,100,100,100,100,100,100,100,100,...,-7515.916799,4864890.0,1,1,0,0,0,0,1380872703,1-1
1,100,100,100,100,100,100,100,100,100,100,...,-7383.867221,4864840.0,4,2,0,0,0,13,1381155054,2-4
2,100,100,100,100,100,100,100,100,100,100,...,-7374.30208,4864847.0,4,2,0,0,0,13,1381155095,2-4
3,100,100,100,100,100,100,100,100,100,100,...,-7365.824883,4864843.0,4,2,0,0,0,13,1381155138,2-4
4,100,100,100,100,100,100,100,100,100,100,...,-7641.499303,4864922.0,2,0,0,0,0,2,1380877774,0-2


In [391]:
Validation_data.drop(columns=['BUILDINGID','FLOOR','SPACEID','USERID','RELATIVEPOSITION','PHONEID','TIMESTAMP'],axis=1,inplace=True)

In [392]:
Validation_data.head()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP514,WAP515,WAP516,WAP517,WAP518,WAP519,WAP520,LONGITUDE,LATITUDE,Location
0,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7515.916799,4864890.0,1-1
1,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7383.867221,4864840.0,2-4
2,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7374.30208,4864847.0,2-4
3,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7365.824883,4864843.0,2-4
4,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,-7641.499303,4864922.0,0-2


In [393]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Step 1: Choose Features and Labels
# Identify the columns you want as features (X) and the column you want as the label (y)

# Example:
# Features (X) can be all columns except the one you want as the label.
X_train = Training_data.drop(columns=['Location'])  # Replace 'LabelColumn' with the actual label column name.
y_train = Training_data[['Location']]

X_test = Validation_data.drop(columns=['Location'])
y_test = Validation_data[['Location']]
# Assuming you have the following dataframes for training and testing
# X_train, y_train, X_test, y_test

# Create a logistic regression model
logistic_model = LogisticRegression()

# Hyperparameter tuning with GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}  # You can adjust the list of C values
grid_search = GridSearchCV(logistic_model, param_grid, cv=5)  # Perform 5-fold cross-validation

# Train the logistic regression model on the training data with hyperparameter tuning
grid_search.fit(X_train, y_train)

# Get the best estimator
best_estimator = grid_search.best_estimator_

y_train_pred = best_estimator.predict(X_train)
# Make predictions on the test data
y_pred = best_estimator.predict(X_test)

# Calculate the training accuracy for Location
accuracy_train_building_floor = accuracy_score(y_train['Location'], y_train_pred)

# Calculate the accuracy for Location
accuracy_building_floor = accuracy_score(y_test['Location'], y_pred)

print(f'Training Accuracy for Location: {accuracy_train_building_floor * 100:.4f}%')

print(f'Validation Accuracy for Location: {accuracy_building_floor * 100:.4f}%')


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html

Training Accuracy for Location: 82.1470%
Validation Accuracy for Location: 72.6373%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [394]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Assuming you have the following dataframes for training and testing
# X_train, y_train, X_test, y_test

X_train = Training_data.drop(columns=['Location'])  # Replace 'LabelColumn' with the actual label column name.
y_train = Training_data['Location']  # Use only the label column without double brackets

X_test = Validation_data.drop(columns=['Location'])
y_test = Validation_data['Location']  # Use only the label column without double brackets

# Create a Decision Tree model
decision_tree_model = DecisionTreeClassifier()

# Hyperparameter tuning with GridSearchCV
param_grid = {'max_depth': [None, 10, 20, 30, 40]}  # You can adjust the list of max_depth values
grid_search = GridSearchCV(decision_tree_model, param_grid, cv=5)  # Perform 5-fold cross-validation

# Train the decision tree model on the training data with hyperparameter tuning
grid_search.fit(X_train, y_train)

# Get the best estimator
best_estimator = grid_search.best_estimator_

# Make predictions on the training data
y_train_pred = best_estimator.predict(X_train)

# Make predictions on the test data
y_pred = best_estimator.predict(X_test)

# Calculate the training accuracy for Location
accuracy_train_building = accuracy_score(y_train, y_train_pred)

# Calculate the accuracy for Location
accuracy_building = accuracy_score(y_test, y_pred)

print(f'Training Accuracy for Location: {accuracy_train_building * 100:.4f}%')

print(f'Validation Accuracy for Location: {accuracy_building * 100:.4f}%')


Training Accuracy for Location: 99.9469%
Validation Accuracy for Location: 73.5374%


In [395]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, make_scorer, f1_score
from sklearn.model_selection import cross_val_score, GridSearchCV

# Assuming you have the following dataframes for training and testing
# X_train, y_train_building, X_test, y_test_building

# Create a Gradient Boosting model
gradient_boosting_model = GradientBoostingClassifier()

# Define a custom scoring function for F1-score
scorer = make_scorer(f1_score, average='weighted')

# Perform 5-fold cross-validation (you can change the number of folds as needed)
# Use the custom scoring function
scores = cross_val_score(gradient_boosting_model, X_train, y_train_building, cv=5, scoring=scorer)

# Print the cross-validation scores
print("Cross-Validation Scores:", scores)
print("Mean F1-Score:", scores.mean())

# Define the hyperparameters and their possible values for tuning
param_grid = {
    'n_estimators': [50, 100, 150],  # Number of boosting stages to be used
    'max_depth': [3, 4, 5],         # Maximum depth of individual trees
    'learning_rate': [0.1, 0.2, 0.3]  # Step size shrinking to prevent overfitting
}

# Create a GridSearchCV object
grid_search = GridSearchCV(gradient_boosting_model, param_grid, cv=5, scoring=scorer)

# Fit the GridSearchCV on the training data to find the best hyperparameters
grid_search.fit(X_train, y_train_building)

# Get the best estimator with tuned hyperparameters
best_estimator = grid_search.best_estimator_

# Make predictions on the training data
y_train_pred = best_estimator.predict(X_train)

# Make predictions on the test data
y_pred = best_estimator.predict(X_test)

# Calculate the training accuracy for BUILDINGID
accuracy_train_building = accuracy_score(y_train_building, y_train_pred)

# Calculate the accuracy for BUILDINGID
accuracy_building = accuracy_score(y_test_building, y_pred)

print(f'Training Accuracy for BUILDINGID: {accuracy_train_building * 100:.4f}%')
print(f'Validation Accuracy for BUILDINGID: {accuracy_building * 100:.2f}%')


Cross-Validation Scores: [0.85680966 0.98804924 0.94314527 0.98427087 0.94314287]
Mean F1-Score: 0.9430835845444385


KeyboardInterrupt: 