### Counterfeit Medicines Sales Prediction

Counterfeit medicine are fake medicines which are either contaminated or contain the wrong or no active ingredient. They could have the right active ingredient but at the wrong dose. Counterfeit drugs are illegal and are harmful to the health. 10% of the world's medicine is counterfeit, problem is even worse in developing countries .Up to 30% of medicines in developing countries are counterfeit.

Millions of pills, bottles and sachets of counterfeit and illegal medicines are being traded across the world The World Health Organization (WHO) is working with International Criminal Police Organization (Interpol) to dislodge the criminal networks raking in billions of dollars from this cynical trade.

Despite all these efforts , counterfeit medicine selling rackets don’t seem to stop popping here and there. It has become a challenge to deploy resources to counter these; without spreading them too thin and eventually rendering them ineffective. Government has decided that they should focus on illegal operations of high net worth first instead of trying to control all of them. In order to do that they have collected data which will help them to **predict sales figures given an illegal operation's characteristics.**

#### Observations: 
its **supervised ML: Regression problem.**

In [1]:
# importing the libraries
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np 

In [2]:
# reading the data files
datafile_train="counterfeit_train.csv"
datafile_test="counterfeit_test.csv"
bd_train=pd.read_csv(datafile_train)
bd_test=pd.read_csv(datafile_test) 

In [3]:
# size of the data
bd_train.shape, bd_test.shape 

((6818, 12), (1705, 11))

In [4]:
# data information
bd_train.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6818 entries, 0 to 6817
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Medicine_ID          6818 non-null   object 
 1   Counterfeit_Weight   5652 non-null   float64
 2   DistArea_ID          6818 non-null   object 
 3   Active_Since         6818 non-null   int64  
 4   Medicine_MRP         6818 non-null   float64
 5   Medicine_Type        6818 non-null   object 
 6   SidEffect_Level      6818 non-null   object 
 7   Availability_rating  6818 non-null   float64
 8   Area_Type            6818 non-null   object 
 9   Area_City_Type       6818 non-null   object 
 10  Area_dist_level      6818 non-null   object 
 11  Counterfeit_Sales    6818 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 639.3+ KB


In [7]:
bd_test.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705 entries, 0 to 1704
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Medicine_ID          1705 non-null   object 
 1   Counterfeit_Weight   1408 non-null   float64
 2   DistArea_ID          1705 non-null   object 
 3   Active_Since         1705 non-null   int64  
 4   Medicine_MRP         1705 non-null   float64
 5   Medicine_Type        1705 non-null   object 
 6   SidEffect_Level      1705 non-null   object 
 7   Availability_rating  1705 non-null   float64
 8   Area_Type            1705 non-null   object 
 9   Area_City_Type       1705 non-null   object 
 10  Area_dist_level      1705 non-null   object 
dtypes: float64(3), int64(1), object(7)
memory usage: 146.7+ KB


- 'Active_since': is given as numeric data but this is a year(categorical)--> coverting into Categorical **(or)** converting it into numeric as age= (present_year-established_year)
- large no of Missing values present "Counterfeit_weight" column

In [8]:
# train data
bd_train.head() 

Unnamed: 0,Medicine_ID,Counterfeit_Weight,DistArea_ID,Active_Since,Medicine_MRP,Medicine_Type,SidEffect_Level,Availability_rating,Area_Type,Area_City_Type,Area_dist_level,Counterfeit_Sales
0,RRA15,13.1,Area046,1995,160.2366,Antimalarial,critical,0.070422,DownTown,Tier 1,Small,1775.5026
1,YVV26,,Area027,1983,110.4384,Mstablizers,mild,0.013,CityLimits,Tier 3,Medium,3069.152
2,LJC15,9.025,Area046,1995,259.4092,Cardiac,mild,0.060783,DownTown,Tier 1,Small,2603.092
3,GWC40,11.8,Area046,1995,99.983,OralContraceptives,mild,0.065555,DownTown,Tier 1,Small,1101.713
4,QMN13,,Area019,1983,56.4402,Hreplacements,critical,0.248859,MidTownResidential,Tier 1,Small,158.9402


In [9]:
# categorical features/variables in the data
bd_train.select_dtypes('O').columns

Index(['Medicine_ID', 'DistArea_ID', 'Medicine_Type', 'SidEffect_Level',
       'Area_Type', 'Area_City_Type', 'Area_dist_level'],
      dtype='object')

In [10]:
# no of unique values in each categorical column/ feature
bd_train.select_dtypes('O').nunique() 

Medicine_ID        1557
DistArea_ID          10
Medicine_Type        16
SidEffect_Level       2
Area_Type             4
Area_City_Type        3
Area_dist_level       4
dtype: int64

"Medicine ID" having more unique values. creating dummy colums for this would be computationally very expensive and "Medicine ID" is a name of the medicine only.So, we drop the column.

In [11]:
# creating dummies for all ['Medicine_Type','SidEffect_Level','Area_Type','Area_City_Type','Area_dist_level','DistArea_ID'] 
# except 'Medicine_ID' b/z more number of categories
bd_train.select_dtypes('O').nunique() 

Medicine_ID        1557
DistArea_ID          10
Medicine_Type        16
SidEffect_Level       2
Area_Type             4
Area_City_Type        3
Area_dist_level       4
dtype: int64

In [12]:
# creating dummies for categoricl features
for col in ['Medicine_Type','SidEffect_Level','Area_Type','Area_City_Type','Area_dist_level',"DistArea_ID"]:  
    # creating dummies for each column & droping the first dummy column  and converting it into numeric value(0 or 1)
    temp=pd.get_dummies(bd_train[col],prefix=col,drop_first=True).astype('int')
    # adding the dummy columns to main data
    bd_train=pd.concat([temp,bd_train],axis=1)
    # deleting the column after creating dummies 
    bd_train.drop([col],axis=1,inplace=True) 
    
    temp=pd.get_dummies(bd_test[col],prefix=col,drop_first=True).astype('int')
    bd_test=pd.concat([temp,bd_test],axis=1)
    bd_test.drop([col],axis=1,inplace=True)    

# shape of the data after creating dummies
bd_train.shape,bd_test.shape 

((6818, 39), (1705, 38))

In [13]:
# modified data 
bd_train.head() 

Unnamed: 0,DistArea_ID_Area013,DistArea_ID_Area017,DistArea_ID_Area018,DistArea_ID_Area019,DistArea_ID_Area027,DistArea_ID_Area035,DistArea_ID_Area045,DistArea_ID_Area046,DistArea_ID_Area049,Area_dist_level_Medium,...,Medicine_Type_OralContraceptives,Medicine_Type_Statins,Medicine_Type_Stimulants,Medicine_Type_Tranquilizers,Medicine_ID,Counterfeit_Weight,Active_Since,Medicine_MRP,Availability_rating,Counterfeit_Sales
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,RRA15,13.1,1995,160.2366,0.070422,1775.5026
1,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,YVV26,,1983,110.4384,0.013,3069.152
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,LJC15,9.025,1995,259.4092,0.060783,2603.092
3,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,GWC40,11.8,1995,99.983,0.065555,1101.713
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,QMN13,,1983,56.4402,0.248859,158.9402


In [14]:
# uniques values and their count in 'Active_Since' column
bd_train['Active_Since'].value_counts().sort_index() 

Active_Since
1983    1166
1985     749
1995     749
1996     442
1997     739
2000     736
2002     748
2005     760
2007     729
Name: count, dtype: int64

In [15]:
# converting 'Active_since' discrete values to 'Active_age'
bd_train['Active_years']=(2023-bd_train['Active_Since'])
bd_test['Active_years']=(2023-bd_test['Active_Since'])

bd_train.drop(['Active_Since'],axis=1,inplace=True)
bd_test.drop(['Active_Since'],axis=1,inplace=True)

In [16]:
bd_train.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6818 entries, 0 to 6817
Data columns (total 39 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   DistArea_ID_Area013               6818 non-null   int32  
 1   DistArea_ID_Area017               6818 non-null   int32  
 2   DistArea_ID_Area018               6818 non-null   int32  
 3   DistArea_ID_Area019               6818 non-null   int32  
 4   DistArea_ID_Area027               6818 non-null   int32  
 5   DistArea_ID_Area035               6818 non-null   int32  
 6   DistArea_ID_Area045               6818 non-null   int32  
 7   DistArea_ID_Area046               6818 non-null   int32  
 8   DistArea_ID_Area049               6818 non-null   int32  
 9   Area_dist_level_Medium            6818 non-null   int32  
 10  Area_dist_level_Small             6818 non-null   int32  
 11  Area_dist_level_Unknown           6818 non-null   int32  
 12  Area_C

All the columns are numerical except "Medicine ID".

**Imputing the missing data: using KNN Imputer**

In [17]:
# handling missing value
bd_train.isnull().sum() 

DistArea_ID_Area013                    0
DistArea_ID_Area017                    0
DistArea_ID_Area018                    0
DistArea_ID_Area019                    0
DistArea_ID_Area027                    0
DistArea_ID_Area035                    0
DistArea_ID_Area045                    0
DistArea_ID_Area046                    0
DistArea_ID_Area049                    0
Area_dist_level_Medium                 0
Area_dist_level_Small                  0
Area_dist_level_Unknown                0
Area_City_Type_Tier 2                  0
Area_City_Type_Tier 3                  0
Area_Type_DownTown                     0
Area_Type_Industrial                   0
Area_Type_MidTownResidential           0
SidEffect_Level_mild                   0
Medicine_Type_Antacids                 0
Medicine_Type_Antibiotics              0
Medicine_Type_Antifungal               0
Medicine_Type_Antimalarial             0
Medicine_Type_Antipyretics             0
Medicine_Type_Antiseptics              0
Medicine_Type_An

missing value present in the "counterfeit_weight".

In [18]:
# spliting the data into input(x) and output(y) and droping the 'Medicine_ID' from training and test data
x_train=bd_train.drop(['Counterfeit_Sales','Medicine_ID'],axis=1)
y_train=bd_train['Counterfeit_Sales']

x_test=bd_test.drop(labels=['Medicine_ID'],axis=1)

x_train.shape,x_test.shape

((6818, 37), (1705, 37))

Imputing the missing values:

In [20]:
# importing the k-Nearest Neighbors imputer  
from sklearn.impute import KNNImputer
impute=KNNImputer(n_neighbors=5)  

In [23]:
# imputing the missing values and converting into a dataframe
# imputing is based on train data only.(i.e fiting the KNN imputer model on train data only)
x_train_imputed=pd.DataFrame(impute.fit_transform(x_train),columns=x_train.columns)
x_test_imputed=pd.DataFrame(impute.transform(x_test),columns=x_test.columns) 

In [24]:
# x_train after imputing
x_train_imputed.head() 

Unnamed: 0,DistArea_ID_Area013,DistArea_ID_Area017,DistArea_ID_Area018,DistArea_ID_Area019,DistArea_ID_Area027,DistArea_ID_Area035,DistArea_ID_Area045,DistArea_ID_Area046,DistArea_ID_Area049,Area_dist_level_Medium,...,Medicine_Type_Mstablizers,Medicine_Type_MuscleRelaxants,Medicine_Type_OralContraceptives,Medicine_Type_Statins,Medicine_Type_Stimulants,Medicine_Type_Tranquilizers,Counterfeit_Weight,Medicine_MRP,Availability_rating,Active_years
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,13.1,160.2366,0.070422,28.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,11.946,110.4384,0.013,40.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,9.025,259.4092,0.060783,28.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,11.8,99.983,0.065555,28.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,14.935,56.4402,0.248859,40.0


In [25]:
# x_test after imputing
x_test_imputed.head() 

Unnamed: 0,DistArea_ID_Area013,DistArea_ID_Area017,DistArea_ID_Area018,DistArea_ID_Area019,DistArea_ID_Area027,DistArea_ID_Area035,DistArea_ID_Area045,DistArea_ID_Area046,DistArea_ID_Area049,Area_dist_level_Medium,...,Medicine_Type_Mstablizers,Medicine_Type_MuscleRelaxants,Medicine_Type_OralContraceptives,Medicine_Type_Statins,Medicine_Type_Stimulants,Medicine_Type_Tranquilizers,Counterfeit_Weight,Medicine_MRP,Availability_rating,Active_years
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,17.96,85.5328,0.112747,40.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,13.45,257.146,0.144446,23.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,7.1,98.1172,0.144221,23.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,18.3,135.373,0.100388,27.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,13.757,112.8016,0.022585,40.0


In [29]:
# checking for missing values after imputing
x_train_imputed.isnull().sum()

DistArea_ID_Area013                 0
DistArea_ID_Area017                 0
DistArea_ID_Area018                 0
DistArea_ID_Area019                 0
DistArea_ID_Area027                 0
DistArea_ID_Area035                 0
DistArea_ID_Area045                 0
DistArea_ID_Area046                 0
DistArea_ID_Area049                 0
Area_dist_level_Medium              0
Area_dist_level_Small               0
Area_dist_level_Unknown             0
Area_City_Type_Tier 2               0
Area_City_Type_Tier 3               0
Area_Type_DownTown                  0
Area_Type_Industrial                0
Area_Type_MidTownResidential        0
SidEffect_Level_mild                0
Medicine_Type_Antacids              0
Medicine_Type_Antibiotics           0
Medicine_Type_Antifungal            0
Medicine_Type_Antimalarial          0
Medicine_Type_Antipyretics          0
Medicine_Type_Antiseptics           0
Medicine_Type_Antiviral             0
Medicine_Type_Cardiac               0
Medicine_Typ

In [30]:
# checking for missing values after imputing
x_test_imputed.isnull().sum() 

DistArea_ID_Area013                 0
DistArea_ID_Area017                 0
DistArea_ID_Area018                 0
DistArea_ID_Area019                 0
DistArea_ID_Area027                 0
DistArea_ID_Area035                 0
DistArea_ID_Area045                 0
DistArea_ID_Area046                 0
DistArea_ID_Area049                 0
Area_dist_level_Medium              0
Area_dist_level_Small               0
Area_dist_level_Unknown             0
Area_City_Type_Tier 2               0
Area_City_Type_Tier 3               0
Area_Type_DownTown                  0
Area_Type_Industrial                0
Area_Type_MidTownResidential        0
SidEffect_Level_mild                0
Medicine_Type_Antacids              0
Medicine_Type_Antibiotics           0
Medicine_Type_Antifungal            0
Medicine_Type_Antimalarial          0
Medicine_Type_Antipyretics          0
Medicine_Type_Antiseptics           0
Medicine_Type_Antiviral             0
Medicine_Type_Cardiac               0
Medicine_Typ

In [31]:
x_train_imputed.describe() 

Unnamed: 0,DistArea_ID_Area013,DistArea_ID_Area017,DistArea_ID_Area018,DistArea_ID_Area019,DistArea_ID_Area027,DistArea_ID_Area035,DistArea_ID_Area045,DistArea_ID_Area046,DistArea_ID_Area049,Area_dist_level_Medium,...,Medicine_Type_Mstablizers,Medicine_Type_MuscleRelaxants,Medicine_Type_OralContraceptives,Medicine_Type_Statins,Medicine_Type_Stimulants,Medicine_Type_Tranquilizers,Counterfeit_Weight,Medicine_MRP,Availability_rating,Active_years
count,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,...,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0,6818.0
mean,0.109856,0.11147,0.106923,0.063215,0.107803,0.10971,0.10795,0.109856,0.10839,0.323115,...,0.075389,0.020387,0.101789,0.024494,0.013934,0.061602,14.131664,151.401518,0.079174,27.163684
std,0.312733,0.314736,0.309038,0.243367,0.310154,0.31255,0.310339,0.312733,0.310895,0.467701,...,0.264037,0.141331,0.302393,0.154588,0.117224,0.240448,4.335015,62.203961,0.051481,8.368979
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.855,41.79,0.013,16.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,10.495,104.5094,0.040058,21.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,14.017,153.1957,0.066955,26.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,17.55,196.14835,0.107697,38.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,22.65,277.1884,0.341391,40.0


#### Normalizing the data:
"Medicine_MRP","Active_years" and "Counterfeit_Weight" column mean values are on higher scale compared to other column. so, we need to normalize the data.

In [32]:
# importing the StandardScaler for scaling the data
from sklearn.preprocessing import StandardScaler
std=StandardScaler() 

In [33]:
# scaling the data based on train data only
x_train_std=pd.DataFrame(std.fit_transform(x_train_imputed))
x_test_std=pd.DataFrame(std.transform(x_test_imputed)) 

In [34]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        # np.flatnonzero extracts index of `True` in a boolean array
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        
        for candidate in candidates:
             # print rank of the model
             # values passed to function format here are put in the curly brackets when printing
             # 0 , 1 etc refer to placeholder for position of values passed to format function
             # .3f means upto 3 decimal digits
            print("Model with rank: {0}".format(i))
            # this prints cross validate performance and its standard deviation
            print("Mean validation score: {0:.6f} (std: {1:.6f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            # prints the paramter combination for which this performance was obtained
            print("Parameters: {0}".format(results['params'][candidate]))
            # creates space b/w the output of each iteration
            print("") 

### Model seletion and Evalution:
### Lasso model:

In [35]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV 

In [36]:
model_las=Lasso(fit_intercept=True)

In [37]:
# 'alpha': regularization parameter
params={'alpha':np.linspace(0.01,100,50)} 

In [38]:
# searching for the best hyper-parameters for the Lasso model
grid_search_las=GridSearchCV(model_las,cv=10,param_grid=params,n_jobs=-1,verbose=10,
                         scoring='neg_mean_absolute_error')

# training the model on train data
grid_search_las.fit(x_train_std,y_train)

# performance of different models
report(grid_search_las.cv_results_) 

Fitting 10 folds for each of 50 candidates, totalling 500 fits
Model with rank: 1
Mean validation score: -825.227099 (std: 24.012748)
Parameters: {'alpha': 26.53795918367347}

Model with rank: 2
Mean validation score: -825.227417 (std: 24.034397)
Parameters: {'alpha': 24.49734693877551}

Model with rank: 3
Mean validation score: -825.235181 (std: 24.048513)
Parameters: {'alpha': 22.456734693877554}



**Lasso model:**
Mean Absolute Error: 825.22 

### Random Forest:

In [39]:
# Random Forest
from sklearn.ensemble import RandomForestRegressor
model_rf=RandomForestRegressor() 

In [40]:
model_rf.get_params() 

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [41]:
# parameter dictionaries
params = {"n_estimators":[100,200,300,500],
              "max_features": [8,12,15,20,25,32,37],
              "bootstrap": [True, False],
              'max_depth':[None,5,10,15,20,30],
              'min_samples_leaf':[1,2,5,10,15,20,30], 
              'min_samples_split':[2,5,10,15,20,30] 
                  }

In [44]:
# searching for the best hyper-parameters for the RandomForest model
random_search=RandomizedSearchCV(model_rf,cv=10,n_iter=200,
                       param_distributions=params,
                       n_jobs=-1,verbose=10,
                       scoring='neg_mean_absolute_error') 
# training the model(normalisation of data is not necessary in RandomForest)
random_search.fit(X=x_train_imputed,y=y_train)

# performance of different models
report(random_search.cv_results_) 

Fitting 10 folds for each of 200 candidates, totalling 2000 fits
Model with rank: 1
Mean validation score: -748.305748 (std: 25.763319)
Parameters: {'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 30, 'max_features': 25, 'max_depth': 15, 'bootstrap': True}

Model with rank: 2
Mean validation score: -748.561080 (std: 26.381431)
Parameters: {'n_estimators': 300, 'min_samples_split': 10, 'min_samples_leaf': 30, 'max_features': 25, 'max_depth': 10, 'bootstrap': True}

Model with rank: 3
Mean validation score: -748.596235 (std: 25.509231)
Parameters: {'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 30, 'max_features': 20, 'max_depth': 15, 'bootstrap': True}



**RandomForest model:**
Mean Absolute Error: 748.30

### XGB Regression ( Sequential Tuning):

In [45]:
# importing XGBoost Regressor from sklearm
from xgboost.sklearn import XGBRegressor 

#### Sequenctial Tuning of XGBoost Regressor: 
- First we fix the parameters with most volatile performance (i.e no of trees or n_estimators, learning rate)

- Second controling the individual tree(weak learner):
             
             ## control the tree depth 
             - 'gamma'or 'min_split_loss': min loss reduction reqired to split (higher is conservative)
             - 'min_child_weight': minimum number of instances needed to be in each node(higher is conservative)
             - 'max_depth':max depth of individual tree (lower is better) 
             
             ## to avoid the noise in the data
             - 'subsample': subsampling ratio of training instances(0 to 1)
             - 'colsample_bytree': subsampling ratio of columns for each tree
             
- regularization parameters:

             - 'lambda': Ridge or L2
             - 'alpha': lasso or L1 

In [49]:
# As learning rate is inversely proportional to no of estimators
# fixing the decent learning rate(0.05) and tuning for the no of estimators
xgb_params = { "n_estimators":[25,50,100,150,200,300,500] }

# storing the model in a object and searing the for best hyper parameters 
xgb1=XGBRegressor(learning_rate=0.05,subsample=0.8,colsample_bylevel=0.8,colsample_bytree=0.8)
grid_search_xgb1=GridSearchCV(xgb1,cv=10,param_grid=xgb_params,scoring='neg_mean_absolute_error',verbose=False,n_jobs=-1)

# traing the models on train data
grid_search_xgb1.fit(x_train_imputed,y_train)

# performance of different models
report(grid_search_xgb1.cv_results_,3) 

Model with rank: 1
Mean validation score: -756.176966 (std: 26.998273)
Parameters: {'n_estimators': 100}

Model with rank: 2
Mean validation score: -761.745456 (std: 27.635956)
Parameters: {'n_estimators': 150}

Model with rank: 3
Mean validation score: -767.156727 (std: 28.005764)
Parameters: {'n_estimators': 200}



In [50]:
# controling the individual tree to avoid overfiting
xgb_params = {"gamma":[0,2,5,8,10],
              "max_depth": [2,3,4,5,6,7,8],
              "min_child_weight":range(1,20)}

xgb2=XGBRegressor(n_estimators=100,learning_rate=0.05,
                   subsample=0.8,colsample_bylevel=0.8,colsample_bytree=0.8)

grid_search_xgb2=GridSearchCV(xgb2,param_grid=xgb_params,cv=5,
                            # sklearn alway try to maximize the score, but we need error(MAE) min.
                            # error function is mutiplied with -1 then our error function will becomes maximizaion function.
                            scoring='neg_mean_absolute_error',                                 
                            verbose=False,n_jobs=-1)

grid_search_xgb2.fit(x_train_imputed,y_train)

report(grid_search_xgb2.cv_results_,3) 

Model with rank: 1
Mean validation score: -750.429593 (std: 15.543927)
Parameters: {'gamma': 0, 'max_depth': 4, 'min_child_weight': 9}

Model with rank: 1
Mean validation score: -750.429593 (std: 15.543927)
Parameters: {'gamma': 2, 'max_depth': 4, 'min_child_weight': 9}

Model with rank: 1
Mean validation score: -750.429593 (std: 15.543927)
Parameters: {'gamma': 5, 'max_depth': 4, 'min_child_weight': 9}

Model with rank: 1
Mean validation score: -750.429593 (std: 15.543927)
Parameters: {'gamma': 8, 'max_depth': 4, 'min_child_weight': 9}

Model with rank: 1
Mean validation score: -750.429593 (std: 15.543927)
Parameters: {'gamma': 10, 'max_depth': 4, 'min_child_weight': 9}



In [52]:
xgb_params = {'subsample':[i/10 for i in range(5,11)],
            'colsample_bytree':[i/10 for i in range(5,11)]} 

xgb3=XGBRegressor( min_child_weight=9, max_depth=4, gamma=8, # selected gamma=10 as it is more conservative compare to other results
                   n_estimators=100,learning_rate=0.05,
                   subsample=0.8,colsample_bylevel=0.8,colsample_bytree=0.8)

grid_search_xgb3=GridSearchCV(xgb3,param_grid=xgb_params,cv=10,
                             scoring='neg_mean_absolute_error', verbose=True,n_jobs=-1) 

grid_search_xgb3.fit(x_train_imputed,y_train)

report(grid_search_xgb3.cv_results_,3) 

Fitting 10 folds for each of 36 candidates, totalling 360 fits
Model with rank: 1
Mean validation score: -749.098293 (std: 25.772950)
Parameters: {'colsample_bytree': 0.9, 'subsample': 0.5}

Model with rank: 2
Mean validation score: -749.859818 (std: 27.744730)
Parameters: {'colsample_bytree': 0.9, 'subsample': 1.0}

Model with rank: 3
Mean validation score: -750.132992 (std: 26.725349)
Parameters: {'colsample_bytree': 0.8, 'subsample': 0.9}



In [56]:
xgb_params={'reg_lambda':[i/10 for i in range(0,50,2)],
            'reg_alpha':[i/10 for i in range(0,50,2)]} 

xgb4=XGBRegressor( n_estimators=100,learning_rate=0.05,
                   min_child_weight=9, max_depth=4, gamma=8,
                   subsample=0.5,colsample_bylevel=0.8,colsample_bytree=0.9)

grid_search_xgb4=GridSearchCV(xgb3,param_grid=xgb_params,cv=10,
                             scoring='neg_mean_absolute_error', verbose=True,n_jobs=-1) 

grid_search_xgb4.fit(x_train_imputed,y_train)

report(grid_search_xgb4.cv_results_,3) 

Fitting 10 folds for each of 625 candidates, totalling 6250 fits
Model with rank: 1
Mean validation score: -749.098293 (std: 25.772950)
Parameters: {'reg_alpha': 0.0, 'reg_lambda': 1.0}

Model with rank: 2
Mean validation score: -749.098378 (std: 25.772912)
Parameters: {'reg_alpha': 0.2, 'reg_lambda': 1.0}

Model with rank: 3
Mean validation score: -749.113926 (std: 25.808539)
Parameters: {'reg_alpha': 1.0, 'reg_lambda': 1.0}



**XGBoost Regressor model:**
Mean Absolute Error: 749.09 

In [57]:
# final XGBoost Regressor model
xgb5=XGBRegressor( n_estimators=100,learning_rate=0.05,
                   min_child_weight=9, max_depth=4, gamma=8,
                   subsample=0.5,colsample_bylevel=0.8,colsample_bytree=0.9,
                 reg_lambda=1,reg_alpha=0) 

xgb5.fit(x_train_imputed,y_train) 

In [58]:
xgb5.predict(x_test_imputed)

array([2145.8267, 3788.1338, 1484.7267, ..., 2854.1914, 3634.9602,
       3761.942 ], dtype=float32)

#### Predictions:

Mostly XGBoost should performe better compared to RandomForest with large data size but out data set size not too big. so,RandomForest and XGBoost models giving similar performance with Mean Absolute Error: 748.3 & 749.09 respectively.
we predict the "Counterfeit_Sales" using RandomForest Model.

In [90]:
# saving the best model
model_rf=random_search.best_estimator_ 

# predicting the counterfeit_sales
predictions_rf=pd.DataFrame({'Counterfeit_Sales':model_rf.predict(x_test_imputed)}) 
# storing the results in .csv format
predictions_rf.to_csv('E:/Data science/Edvancer/ML with Python/Projects/Project 3_Public Safety/NaiduBabu_Yadla_rf_P3.csv',index=False)