# CH4 Emission Analysis using Machine Learning Model

Each country have CH4 Emission from different Items and we are interested to classify the Emission values into different zones depending on all the features. 

#### Since CH4 is more harmful, the Zones are split into 4 different Categories with 0 for Green, 1 for Yellow, 2 for Orange and 3 for Red. The countries in Zone 3 are in Red Zone and need to take immediate action to reduce the N20 Emission.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=Warning)

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

### Get and Clean Data

Element Codes: 7225

Item_Codes: 5058, 5059, 5060, 5066, 6795, 6992, 6993, 6994, 6516

Year: 2011 to 2019


As we have many small countries in our list with very less Emission (<1) and population (<5000), which are not impacting the world Emission. So, we are not considering those records. 

In [3]:
#Loading the Csv file from S3 Bucket
noworld_population_df = pd.read_csv("https://dataanalyticsproject.s3.us-east-2.amazonaws.com/Merged_L5000.csv",index_col=[0]) 
#noworld_population_df = pd.read_csv("Emission_Population_L5000_Data.csv") 
noworld_population_df

Unnamed: 0,Area_Code,Area,Item_Code,Item,Element_Code,Element,Year,Population,Emission
0,2,Afghanistan,5058,Enteric Fermentation,7225,Emissions (CH4),1990,12412.308,178.4682
1,2,Afghanistan,5058,Enteric Fermentation,724413,Emissions (CO2eq) from CH4 (AR5),1990,12412.308,4997.1108
2,2,Afghanistan,5058,Enteric Fermentation,723113,Emissions (CO2eq) (AR5),1990,12412.308,4997.1108
3,2,Afghanistan,5059,Manure Management,7225,Emissions (CH4),1990,12412.308,8.5165
4,2,Afghanistan,5059,Manure Management,7230,Emissions (N2O),1990,12412.308,0.3046
...,...,...,...,...,...,...,...,...,...
844768,181,Zimbabwe,6516,Land Use change,7230,Emissions (N2O),2019,14645.468,0.0000
844769,181,Zimbabwe,6516,Land Use change,7273,Emissions (CO2),2019,14645.468,10662.4408
844770,181,Zimbabwe,6516,Land Use change,724413,Emissions (CO2eq) from CH4 (AR5),2019,14645.468,0.0000
844771,181,Zimbabwe,6516,Land Use change,724313,Emissions (CO2eq) from N2O (AR5),2019,14645.468,0.0000


In [4]:
#Finding the Statistical values of each feature
noworld_population_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Area_Code,728361.0,129.037646,75.436579,1.0,64.0,128.0,192.0,351.0
Item_Code,728361.0,9750.328847,16082.691077,1707.0,5061.0,6750.0,6994.0,69921.0
Element_Code,728361.0,354233.179775,358123.888115,7225.0,7230.0,7273.0,724313.0,724413.0
Year,728361.0,2004.643095,8.607385,1990.0,1997.0,2005.0,2012.0,2019.0
Population,728361.0,38135.675229,156133.177982,0.768,766.615,5716.161,20526.303,1465634.161
Emission,728361.0,4009.738416,43708.062051,-797183.079,0.0076,3.5152,216.6866,2171273.959


In [5]:
#Details of non-numeric features
noworld_population_df.describe(include=['object']).T

Unnamed: 0,count,unique,top,freq
Area,728361,240,Portugal,3570
Item,728361,25,Emissions on agricultural land,53082
Element,728361,8,Emissions (CO2eq) (AR5),146660


In [6]:
#Checking the null values
noworld_population_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 728361 entries, 0 to 844772
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Area_Code     728361 non-null  int64  
 1   Area          728361 non-null  object 
 2   Item_Code     728361 non-null  int64  
 3   Item          728361 non-null  object 
 4   Element_Code  728361 non-null  int64  
 5   Element       728361 non-null  object 
 6   Year          728361 non-null  int64  
 7   Population    728361 non-null  float64
 8   Emission      728361 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 55.6+ MB


In [7]:
#List of columns
noworld_population_df.columns

Index(['Area_Code', 'Area', 'Item_Code', 'Item', 'Element_Code', 'Element',
       'Year', 'Population', 'Emission'],
      dtype='object')

In [8]:
#As mentioned above, we are trying to extract the corresponding data from Emission dataset for Countries
emissions_CH4_df = noworld_population_df[noworld_population_df['Item_Code'].isin([5058, 5059, 5060, 5066, 6795, 6992, 6993, 6994, 6516]) & 
                                        (noworld_population_df['Element_Code'] == 7225) &
                                        (noworld_population_df['Year'] > 2010 ) & (noworld_population_df['Emission'] > 1) &
                                            (noworld_population_df['Population'] > 5000) ]


emissions_CH4_df

Unnamed: 0,Area_Code,Area,Item_Code,Item,Element_Code,Element,Year,Population,Emission
591077,2,Afghanistan,5058,Enteric Fermentation,7225,Emissions (CH4),2011,30117.413,402.5130
591080,2,Afghanistan,5059,Manure Management,7225,Emissions (CH4),2011,30117.413,26.1599
591085,2,Afghanistan,5060,Rice Cultivation,7225,Emissions (CH4),2011,30117.413,29.4000
591108,2,Afghanistan,5066,Burning - Crop residues,7225,Emissions (CH4),2011,30117.413,3.2219
591309,4,Algeria,5058,Enteric Fermentation,7225,Emissions (CH4),2011,36661.445,228.8808
...,...,...,...,...,...,...,...,...,...
844663,181,Zimbabwe,5059,Manure Management,7225,Emissions (CH4),2019,14645.468,7.4455
844668,181,Zimbabwe,5060,Rice Cultivation,7225,Emissions (CH4),2019,14645.468,1.8242
844691,181,Zimbabwe,5066,Burning - Crop residues,7225,Emissions (CH4),2019,14645.468,2.5388
844700,181,Zimbabwe,6795,Savanna fires,7225,Emissions (CH4),2019,14645.468,31.2309


In [9]:
#Finding the Statistical values of each feature
emissions_CH4_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Area_Code,5034.0,134.58085,79.526969,2.0,60.0,130.0,202.0,351.0
Item_Code,5034.0,5572.480334,806.514979,5058.0,5059.0,5060.0,6516.0,6994.0
Element_Code,5034.0,7225.0,0.0,7225.0,7225.0,7225.0,7225.0,7225.0
Year,5034.0,2015.009138,2.578918,2011.0,2013.0,2015.0,2017.0,2019.0
Population,5034.0,100819.290266,266051.85895,5013.709,11482.178,26969.307,63811.199,1465634.0
Emission,5034.0,279.549859,1062.21843,1.0008,5.345475,24.3818,137.7057,14053.66


In [10]:
#Dropping the unwanted columns 
emissions_CH4_df=emissions_CH4_df.drop(['Area','Item','Element', 'Element_Code'],axis=1)
emissions_CH4_df.head()

Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission
591077,2,5058,2011,30117.413,402.513
591080,2,5059,2011,30117.413,26.1599
591085,2,5060,2011,30117.413,29.4
591108,2,5066,2011,30117.413,3.2219
591309,4,5058,2011,36661.445,228.8808


# Categorizing data

Item_Code = 0 to 8

Year = 0 to 9 (2011 through 2019)

Population = 0 to 4

Emission = 0 to 7

Zone 0 to 3 

In [11]:
#Catagorizing the Item_Code data
emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 5058, "Item_Code"] = 0
emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 5059, "Item_Code"] = 1
emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 5060, "Item_Code"] = 2
emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 5066, "Item_Code"] = 3

emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 6516, "Item_Code"] = 4
emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 6795, "Item_Code"] = 5
emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 6992, "Item_Code"] = 6
emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 6993, "Item_Code"] = 7
emissions_CH4_df.loc[emissions_CH4_df["Item_Code"] == 6994, "Item_Code"] = 8

In [12]:
emissions_CH4_df.head()

Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission
591077,2,0,2011,30117.413,402.513
591080,2,1,2011,30117.413,26.1599
591085,2,2,2011,30117.413,29.4
591108,2,3,2011,30117.413,3.2219
591309,4,0,2011,36661.445,228.8808


In [13]:
#Catagorizing Year 2010 through 2019
emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2010, "Year"] = 0
emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2011, "Year"] = 1
emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2012, "Year"] = 2

emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2013, "Year"] = 3
emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2014, "Year"] = 4
emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2015, "Year"] = 5

emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2016, "Year"] = 6
emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2017, "Year"] = 7
emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2018, "Year"] = 8
emissions_CH4_df.loc[emissions_CH4_df["Year"] == 2019, "Year"] = 9

emissions_CH4_df.head()

Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission
591077,2,0,1,30117.413,402.513
591080,2,1,1,30117.413,26.1599
591085,2,2,1,30117.413,29.4
591108,2,3,1,30117.413,3.2219
591309,4,0,1,36661.445,228.8808


In [14]:
#Catagorizing Population data into 5 categories

emissions_CH4_df.loc[emissions_CH4_df["Population"] <=10000, "Population"] = 0
emissions_CH4_df.loc[(emissions_CH4_df["Population"] > 10000) & (emissions_CH4_df["Population"] <= 50000) , "Population"] = 1
emissions_CH4_df.loc[(emissions_CH4_df["Population"] > 50000) & (emissions_CH4_df["Population"] <= 100000), "Population"] = 2
emissions_CH4_df.loc[(emissions_CH4_df["Population"] > 100000) & (emissions_CH4_df["Population"] <= 1000000) , "Population"] = 3
emissions_CH4_df.loc[(emissions_CH4_df["Population"] > 1000000) , "Population"] = 4     
emissions_CH4_df.head()


Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission
591077,2,0,1,1.0,402.513
591080,2,1,1,1.0,26.1599
591085,2,2,1,1.0,29.4
591108,2,3,1,1.0,3.2219
591309,4,0,1,1.0,228.8808


In [15]:
#Creating Zone Variable
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 1) & (emissions_CH4_df["Emission"] <= 10) , "Zone"] = 0
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 10) & (emissions_CH4_df["Emission"] <= 25) , "Zone"] = 1
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 25) & (emissions_CH4_df["Emission"] <= 75) , "Zone"] = 2
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 75),"Zone"]= 3 

emissions_CH4_df.head() 

Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission,Zone
591077,2,0,1,1.0,402.513,3.0
591080,2,1,1,1.0,26.1599,2.0
591085,2,2,1,1.0,29.4,2.0
591108,2,3,1,1.0,3.2219,0.0
591309,4,0,1,1.0,228.8808,3.0


In [16]:
#Catagorizing Emission values into 7 different categories

emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 1) & (emissions_CH4_df["Emission"] <= 10) , "Emission"] = 0
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 10) & (emissions_CH4_df["Emission"] <= 15) , "Emission"] = 1
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 15) & (emissions_CH4_df["Emission"] <= 20) , "Emission"] = 2
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 20) & (emissions_CH4_df["Emission"] <= 30) , "Emission"] = 3
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 30) & (emissions_CH4_df["Emission"] <= 50) , "Emission"] = 4
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 50) & (emissions_CH4_df["Emission"] <= 100) , "Emission"] = 5
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 100) & (emissions_CH4_df["Emission"] <= 200) , "Emission"] = 6
emissions_CH4_df.loc[(emissions_CH4_df["Emission"] > 200),"Emission"] = 7

emissions_CH4_df.head()                      

Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission,Zone
591077,2,0,1,1.0,7.0,3.0
591080,2,1,1,1.0,3.0,2.0
591085,2,2,1,1.0,3.0,2.0
591108,2,3,1,1.0,0.0,0.0
591309,4,0,1,1.0,7.0,3.0


In [17]:
emissions_CH4_df["Population"] = emissions_CH4_df["Population"].astype(int)
emissions_CH4_df["Emission"] = emissions_CH4_df["Emission"].astype(int)
emissions_CH4_df["Zone"] = emissions_CH4_df["Zone"].astype(int)

In [18]:
emissions_CH4_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Area_Code,5034.0,134.58085,79.526969,2.0,60.0,130.0,202.0,351.0
Item_Code,5034.0,2.54708,2.234572,0.0,1.0,2.0,4.0,8.0
Year,5034.0,5.009138,2.578918,1.0,3.0,5.0,7.0,9.0
Population,5034.0,1.334724,1.020843,0.0,1.0,1.0,2.0,4.0
Emission,5034.0,3.130314,2.810687,0.0,0.0,3.0,6.0,7.0
Zone,5034.0,1.472785,1.271176,0.0,0.0,1.0,3.0,3.0


In [19]:
emissions_CH4_df.tail(20) 

Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission,Zone
844176,237,2,9,2,7,3
844199,237,3,9,2,1,1
844214,237,5,9,2,0,0
844224,237,6,9,2,0,0
844281,237,4,9,2,0,0
844431,249,0,9,1,6,3
844434,249,1,9,1,0,0
844541,251,0,9,1,6,3
844544,251,1,9,1,0,0
844549,251,2,9,1,0,0


In [20]:
emissions_CH4_df.nunique()

Area_Code     120
Item_Code       9
Year            9
Population      5
Emission        8
Zone            4
dtype: int64

In [21]:
emissions_CH4_df["Emission"].value_counts()

0    1776
7     971
5     508
6     507
3     372
4     370
1     277
2     253
Name: Emission, dtype: int64

In [22]:
emissions_CH4_df["Zone"].value_counts()

0    1776
3    1663
2     830
1     765
Name: Zone, dtype: int64

In [23]:
emissions_CH4_df["Population"].value_counts()

1    2651
0     854
2     717
3     614
4     198
Name: Population, dtype: int64

## Machine Learning

Data cleaning and classification parts are done for the input features.

Machine learning methods that predict the future Emission depends on many factors like soil temperature,air moisture,Volumetric Water Content(VWC). So, we end up with Classification algorithms which will help us identifying the Emission values into different Zones for each Elements (N2O, CH4, CO2). 

As we have **Imbalanced Emission values** depending on the Country size and population we just can't classify the Zones into Binary vlaues. **Multiclass classification** is the problem of classifying instances into one of three or more classes.

#### Popular algorithms that can be used for multi-class classification include:

Logistic regression

Decision Trees

Random Forest

Naive Bayes

k-Nearest Neighbors

Gradient Boosting


**Logistic regression** is a simple yet very effective classification algorithm. Multinomial logistic regression is an extension of logistic regression that adds native support for multi-class classification problems. So, we are starting with this algorithm. 

In [24]:
emissions_CH4_df.reset_index(inplace=True, drop=True)

In [25]:
emissions_CH4_df

Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission,Zone
0,2,0,1,1,7,3
1,2,1,1,1,3,2
2,2,2,1,1,3,2
3,2,3,1,1,0,0
4,4,0,1,1,7,3
...,...,...,...,...,...,...
5029,181,1,9,1,0,0
5030,181,2,9,1,0,0
5031,181,3,9,1,0,0
5032,181,5,9,1,4,2


In [26]:
# Segment the features from the target
X = emissions_CH4_df[["Item_Code", "Year", "Population", "Emission"]]
y = emissions_CH4_df[["Zone"]]

In [27]:
X

Unnamed: 0,Item_Code,Year,Population,Emission
0,0,1,1,7
1,1,1,1,3
2,2,1,1,3
3,3,1,1,0
4,0,1,1,7
...,...,...,...,...
5029,1,9,1,0
5030,2,9,1,0
5031,3,9,1,0
5032,5,9,1,4


In [28]:
#y = y.ravel()
y

Unnamed: 0,Zone
0,3
1,2
2,2
3,0
4,3
...,...
5029,0
5030,0
5031,0
5032,2


In [29]:
emissions_CH4_df

Unnamed: 0,Area_Code,Item_Code,Year,Population,Emission,Zone
0,2,0,1,1,7,3
1,2,1,1,1,3,2
2,2,2,1,1,3,2
3,2,3,1,1,0,0
4,4,0,1,1,7,3
...,...,...,...,...,...,...
5029,181,1,9,1,0,0
5030,181,2,9,1,0,0
5031,181,3,9,1,0,0
5032,181,5,9,1,4,2


In [30]:
np.shape(X)

(5034, 4)

In [31]:
np.shape(y)

(5034, 1)

In [32]:
test_sizes = 0.20
seed = 1
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=test_sizes, random_state=seed, stratify=y)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(4027, 4)
(1007, 4)
(4027, 1)
(1007, 1)


In [33]:
#model = LogisticRegression()
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

In [34]:
a = model.fit(X_train, Y_train)
a

LogisticRegression(multi_class='multinomial')

In [35]:
predictions = model.predict(X_test)

In [36]:
predictions

array([0, 3, 0, ..., 0, 2, 2])

In [37]:
X_train

Unnamed: 0,Item_Code,Year,Population,Emission
2006,0,4,1,7
2382,3,5,1,0
1441,6,3,2,6
4206,3,8,1,0
3550,2,7,1,2
...,...,...,...,...
1572,1,3,0,3
2675,5,5,2,4
4071,6,8,2,7
3223,5,6,1,4


In [38]:
X_test
print(a.score(X_test, Y_test))

0.94240317775571


In [39]:
Y_test

Unnamed: 0,Zone
1145,0
4884,3
1871,0
4349,2
63,2
...,...
2713,2
216,0
4236,0
3816,2


In [40]:
predictions

array([0, 3, 0, ..., 0, 2, 2])

In [41]:
print(classification_report(Y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       355
           1       0.86      1.00      0.92       153
           2       0.82      0.83      0.83       166
           3       0.99      0.91      0.95       333

    accuracy                           0.94      1007
   macro avg       0.92      0.94      0.92      1007
weighted avg       0.95      0.94      0.94      1007



In [42]:
confusion_matrix(Y_test, predictions)

array([[355,   0,   0,   0],
       [  0, 153,   0,   0],
       [  0,  25, 138,   3],
       [  0,   0,  30, 303]], dtype=int64)

## Random forest classifier

A random forest classifier works with data having discrete labels or better known as class. 

#### Advantages of Random Forest

It reduces overfitting in decision trees and helps to improve the accuracy

It is flexible to both classification and regression problems

It works well with both categorical and continuous values

It automates missing values present in the data

Normalising of data is not required as it uses a rule-based approach.


### Emission Zone ( 0 , 1, 2, 3, 4)

In [43]:
emissions_CH4_array = np.asarray(emissions_CH4_df)
emissions_CH4_array

array([[  2,   0,   1,   1,   7,   3],
       [  2,   1,   1,   1,   3,   2],
       [  2,   2,   1,   1,   3,   2],
       ...,
       [181,   3,   9,   1,   0,   0],
       [181,   5,   9,   1,   4,   2],
       [181,   6,   9,   1,   0,   0]], dtype=int64)

In [44]:
X = emissions_CH4_array[:,1:5]
X

array([[0, 1, 1, 7],
       [1, 1, 1, 3],
       [2, 1, 1, 3],
       ...,
       [3, 9, 1, 0],
       [5, 9, 1, 4],
       [6, 9, 1, 0]], dtype=int64)

In [45]:
y = emissions_CH4_array[:,5:6]
y

array([[3],
       [2],
       [2],
       ...,
       [0],
       [2],
       [0]], dtype=int64)

In [46]:
np.shape(X)

(5034, 4)

In [47]:
np.shape(y)

(5034, 1)

In [49]:
test_sizes = 0.20
seed = 1
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = test_sizes, random_state =seed,stratify=y)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(4027, 4)
(1007, 4)
(4027, 1)
(1007, 1)


In [50]:
Emission_model = RandomForestClassifier(max_depth=2, random_state=0)

In [51]:
Emission_model_fit = Emission_model.fit(X_train, Y_train)

In [52]:
model_prediction = Emission_model_fit.predict(X_test)

In [53]:
Emission_model_fit.score(X_test, Y_test)

0.7894736842105263

In [55]:
model_prediction

array([0, 3, 0, ..., 0, 2, 2], dtype=int64)

In [56]:
confusion_matrix = confusion_matrix(Y_test, model_prediction)
print(confusion_matrix)

[[355   0   0   0]
 [  0  38 102  13]
 [  0  15  84  67]
 [  0   0  15 318]]


In [57]:
print(classification_report(Y_test, model_prediction))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       355
           1       0.72      0.25      0.37       153
           2       0.42      0.51      0.46       166
           3       0.80      0.95      0.87       333

    accuracy                           0.79      1007
   macro avg       0.73      0.68      0.67      1007
weighted avg       0.79      0.79      0.77      1007



##  Trying Scalar and n-estimators

In [58]:
# Splitting into Train and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

In [59]:
# Creating a StandardScaler instance.
scaler = StandardScaler()
# Fitting the Standard Scaler with the training data.
X_scaler = scaler.fit(X_train)

# Scaling the data.
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [60]:
X_train_scaled

array([[ 2.45703884, -1.17247264, -0.32610237, -1.11736756],
       [ 0.20727615, -0.78336634, -0.32610237, -1.11736756],
       [ 1.55713376, -0.39426003, -0.32610237,  1.3653061 ],
       ...,
       [-0.69262893, -0.00515373,  0.65793955,  1.01063843],
       [-0.69262893,  1.16216519,  2.62602337,  1.3653061 ],
       [ 0.65722869, -1.56157895,  0.65793955, -0.7626999 ]])

In [61]:
# Create a random forest classifier.
rf_model = RandomForestClassifier(n_estimators=128, random_state=78) 

In [62]:
# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

In [63]:
# Making predictions using the testing data.
predictions_s = rf_model.predict(X_test_scaled)

In [64]:
predictions_s

array([0, 3, 0, ..., 3, 0, 0], dtype=int64)

In [65]:
y_test

array([[0],
       [3],
       [0],
       ...,
       [3],
       [0],
       [0]], dtype=int64)

In [66]:
# Calculating the accuracy score
acc_score = accuracy_score(y_test, predictions_s)

In [67]:
acc_score

0.9189833200953137

## New Data from out of the file to predict the Model

In [68]:
new_data = [(2, 8, 4, 6) , (9, 9, 1, 1)]

In [69]:
new_array = np.asarray(new_data)

In [70]:
labels =["Green", "Yellow", "Orange", "Red"]

### Logistic Regression Classifier

In [71]:
new_predicts = model.predict(new_array)

In [73]:
for i in range(2):
    print(new_data[i], labels[int(new_predicts[i])])

(2, 8, 4, 6) Red
(9, 9, 1, 1) Yellow


In [74]:
new_predicts

array([3, 1])

### Random Forest Classifier

In [75]:
new_predict_rf = Emission_model.predict(new_array)

In [76]:
for i in range(2):
    print(new_data[i], labels[int(new_predict_rf[i])])

(2, 8, 4, 6) Red
(9, 9, 1, 1) Orange


In [77]:
new_predict_rf

array([3, 2], dtype=int64)