## Building a classification model for the spx dataset


In this Jupyter notebook, we will be building a classification model for the spx data set using the random forest algorithm.




### 1. Import libraries

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

### 2. Load the dataset

In [2]:
# Import excel file using pandas
data = pd.read_excel("https://raw.githubusercontent.com/vsipra/ml-pipeline-spx/main/SPX_train_0.xlsx")

# The dataset has 899 rows and 16 columns

### 3. Input Features

The spx dataset contains 15 input features and 1 output variable (the class label).

In [3]:
X = data.drop(columns=["Returns","Time"], axis=1)   #Feature Matrix
Y = data["Returns"]          #Target Variable

In [4]:
X.shape

(899, 14)

In [5]:
Y.shape

(899,)

In [6]:
print(X)

           dp        dy        ep        de      svar        bm      ntis  \
0   -3.041609 -3.027403 -2.662340 -0.379269  0.000924  0.735342  0.016454   
1   -3.096132 -3.036338 -2.711553 -0.384579  0.000655  0.704489  0.014836   
2   -3.043790 -3.091042 -2.653829 -0.389961  0.001887  0.767883  0.015963   
3   -3.128109 -3.043790 -2.724389 -0.403720  0.001398  0.715063  0.015086   
4   -3.139500 -3.128109 -2.722106 -0.417394  0.000921  0.702911  0.019773   
..        ...       ...       ...       ...       ...       ...       ...   
894 -3.966309 -3.953266 -3.098391 -0.867918  0.000594  0.233834 -0.012703   
895 -3.941330 -3.959587 -3.086025 -0.855304  0.004318  0.237917 -0.010244   
896 -3.951689 -3.934654 -3.108987 -0.842702  0.000605  0.233377 -0.010959   
897 -3.965984 -3.945758 -3.112869 -0.853115  0.001510  0.232261 -0.013267   
898 -3.993568 -3.960087 -3.130267 -0.863300  0.000306  0.223938 -0.007907   

        tbl     lty     ltr     tms     dfy     dfr      infl  
0    0.0038

### 4. Build classification model using Random Forest

In [7]:
clf = RandomForestClassifier()

In [8]:
# X = data.drop(columns="Time", axis=1, inplace=True)   

In [9]:
clf.fit(X.values, Y)

RandomForestClassifier()

### 5. Feature Importance

In [10]:
print(clf.feature_importances_)

[0.07274592 0.07223183 0.07275173 0.06978757 0.07297904 0.06883099
 0.08194636 0.06643644 0.06718442 0.07169159 0.06722611 0.06231075
 0.07299211 0.08088513]


### 6. Data split (80/20 ratio)

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [12]:
X_train.shape, Y_train.shape

((719, 14), (719,))

In [13]:
X_test.shape, Y_test.shape

((180, 14), (180,))

In [14]:
for col in X:
    print(col)

dp
dy
ep
de
svar
bm
ntis
tbl
lty
ltr
tms
dfy
dfr
infl


### 7. Rebuild the Random Forest Model

In [15]:
clf.fit(X_train.values, Y_train)

RandomForestClassifier()

### 8. Perform predictions on the test set

Predicted class labels

In [17]:
print(clf.predict(X_test.values))

[1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 1 1 1 0 0 1 0 1
 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 1 0 0 1 1 0 1 1
 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0
 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 1
 0 1 1 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1]


Actual class labels

In [18]:
print(Y_test)

842    1
808    1
810    1
820    0
5      0
      ..
398    1
432    1
82     1
379    1
566    1
Name: Returns, Length: 180, dtype: int64


### 9. Model Performance

In [20]:
print(clf.score(X_test.values, Y_test))

0.5666666666666667


In [None]:
### Create a Pickle file using serialization 
import pickle
pickle_out = open("C:\\Users\\vajih\\OneDrive\\Documents\\Learning\\ml-model-spx\\classifier.pkl","wb")
pickle.dump(clf, pickle_out)
pickle_out.close()

In [None]:
python --version

In [22]:
prediction = clf.predict([[2,3,4,1,2,3,4,1,2,3,4,1,1,1]])

#print(type(X_test))

In [None]:
print(X_test)

In [25]:
print(type(prediction))

<class 'numpy.ndarray'>
