<a href="https://colab.research.google.com/github/shivendr7/ml/blob/main/EncodingInaFeatureVector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd

df=pd.read_csv("https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",na_values=['NA','?'])

pd.set_option('display.max_columns',9)
pd.set_option('display.max_rows',5)

display(df)

Unnamed: 0,id,job,area,income,...,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,...,0.885827,0.492126,0.071100,b
1,2,kd,c,60369.0,...,0.874016,0.342520,0.400809,c
...,...,...,...,...,...,...,...,...,...
1998,1999,qp,c,67949.0,...,0.909449,0.598425,0.117803,c
1999,2000,pe,c,61467.0,...,0.925197,0.539370,0.451973,c


The following observations can be made from the above data:

The target column is the column that you seek to predict. There are several candidates here. However, we will initially use product. This field specifies what product someone bought.

There is an ID column. This column should not be fed into the neural network as it contains no information useful for prediction.

Many of these fields are numeric and might not require any further processing.

The income column does have some missing values.

There are categorical values: job, area, and product.

To begin with, we will convert the job code into dummy variables.

In [3]:
dummies=pd.get_dummies(df['job'],prefix="job")
print(dummies.shape)
display(dummies)
dummies.columns

(2000, 33)


Unnamed: 0,job_11,job_al,job_am,job_ax,...,job_rn,job_sa,job_vv,job_zz
0,0,0,0,0,...,0,0,1,0
1,0,0,0,0,...,0,0,0,0
...,...,...,...,...,...,...,...,...,...
1998,0,0,0,0,...,0,0,0,0
1999,0,0,0,0,...,0,0,0,0


Index(['job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv',
       'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd',
       'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb',
       'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp',
       'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz'],
      dtype='object')

Next, we must merge these dummies back into the main data frame. We also drop the original "job" field, as it is now represented by the dummies.

In [4]:
df=pd.concat([df,dummies],axis=1)  #concat dummies
df.drop('job',axis=1,inplace=True)  #removing jobs
display(df)

Unnamed: 0,id,area,income,aspect,...,job_rn,job_sa,job_vv,job_zz
0,1,c,50876.0,13.100000,...,0,0,1,0
1,2,c,60369.0,18.625000,...,0,0,0,0
...,...,...,...,...,...,...,...,...,...
1998,1999,c,67949.0,5.733333,...,0,0,0,0
1999,2000,c,61467.0,16.891667,...,0,0,0,0


In [5]:
areadummies=pd.get_dummies(df['area'],prefix="area")
df=pd.concat([df,areadummies],axis=1)
df.drop('area',axis=1,inplace=True)

display(df)
areadummies.columns

Unnamed: 0,id,income,aspect,subscriptions,...,area_a,area_b,area_c,area_d
0,1,50876.0,13.100000,1,...,0,0,1,0
1,2,60369.0,18.625000,2,...,0,0,1,0
...,...,...,...,...,...,...,...,...,...
1998,1999,67949.0,5.733333,0,...,0,0,1,0
1999,2000,61467.0,16.891667,0,...,0,0,1,0


Index(['area_a', 'area_b', 'area_c', 'area_d'], dtype='object')

In [6]:
# fill in missing income values
med=df['income'].median()
df['income']=df['income'].fillna(med)
display(df)

Unnamed: 0,id,income,aspect,subscriptions,...,area_a,area_b,area_c,area_d
0,1,50876.0,13.100000,1,...,0,0,1,0
1,2,60369.0,18.625000,2,...,0,0,1,0
...,...,...,...,...,...,...,...,...,...
1998,1999,67949.0,5.733333,0,...,0,0,1,0
1999,2000,61467.0,16.891667,0,...,0,0,1,0


In [7]:
print(list(df.columns))

['id', 'income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'product', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


In [8]:
x_columns=df.columns.drop('product').drop('id')
print(list(x_columns))

['income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


In [11]:
#classification test train x,y
x_columns=df.columns.drop('product').drop('id')
x=df[x_columns].values
dummies=pd.get_dummies(df['product'])
products=dummies.columns
y=dummies.values
print(x)
print(y)

[[5.08760000e+04 1.31000000e+01 1.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [6.03690000e+04 1.86250000e+01 2.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [5.51260000e+04 3.47666667e+01 1.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 ...
 [2.85950000e+04 3.94250000e+01 3.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 1.00000000e+00]
 [6.79490000e+04 5.73333333e+00 0.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [6.14670000e+04 1.68916667e+01 0.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]]
[[0 1 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 1 ... 0 0 0]
 [0 0 1 ... 0 0 0]]


The x and y values are now ready for a neural network. Make sure that you construct the neural network for a classification problem. Specifically,

Classification neural networks have an output neuron count equal to the number of classes.

Classification neural networks should use categorical_crossentropy and a softmax activation function on the output layer.

In [15]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

In [None]:
model=Sequential()
model.add(Dense(25,input_dim=x.shape[1],activation='relu'))
model.add(Dense(10,activation='relu'))
model.add(Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam')
model.fit(x,y,verbose=1,epochs=100)


In [32]:
predval=model.predict(np.array([x[2]]))
predindex=np.argmax(predval)  #it returns index XD
predclass=products[predindex]
print(f'predicted class:{predclass}')

predicted class:b


In [33]:
#for regression Neural Network
y=df['income'].values

For a regression neural network, the x values are generated the same. However, y does not use dummies. Make sure to replace income with your actual target.

In [34]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import io
import os
import numpy as np
import requests
from sklearn import metrics

In [None]:
model=Sequential()
model.add(Dense(25,input_dim=x.shape[1],activation='relu'))
model.add(Dense(10,activation='relu'))
model.add(Dense(1)) #no activation
model.compile(loss='mean_squared_error',optimizer='adam')
model.fit(x,y,verbose=1,epochs=50)

In [36]:
model.predict(np.array([x[1]]))

array([[60369.742]], dtype=float32)