# Getting Started: Market Research
This Jupyter notebook is a quick demonstration on how to get started on the market research section.

## 1) Holistic Regression (all features)


In [21]:
import pandas as pd
import csv
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

train_data = pd.read_csv('./data/train_new.csv')
test_data = pd.read_csv('./data/tes_new.csv')

#independent and dependent variables
x = train_data[['A','B','C','D','E','F','G','H','I','J','K','L','M','N']]
y = train_data[['Y1','Y2']]


#splitting data into train and test
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=42)

#running regression
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
#print(x_test)
#print(y_pred)
#checking model performance
print("MSE: ",mean_squared_error(y_test, y_pred))
print("R-Squared: ", r2_score(y_test, y_pred))
print("Coefficients: ", model.coef_)
print("Intercept: ", model.intercept_)

actual_data = test_data[['A','B','C','D','E','F','G','H','I','J','K','L','M','N']]
actual_preds = model.predict(actual_data)
print(actual_preds)
preds = test_data[['id']]
preds['Y1'] = actual_preds[:,0]
preds['Y2'] = actual_preds[:,1]
preds.to_csv('preds.csv', index = False)





FileNotFoundError: [Errno 2] No such file or directory: './data/tes_new.csv'

## 2) Y1 Regression (correlated features only)

In [11]:
#y1 only training data
y1_x = train_data[['G','J','H','C','M','E','N']]
y1_y = train_data['Y1']


#splitting data into train and test
x_train, x_test, y_train, y_test = train_test_split(y1_x,y1_y,test_size=0.2, random_state=42)

#running regression
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

#checking model performance
print("MSE: ",mean_squared_error(y_test, y_pred))
print("R-Squared: ", r2_score(y_test, y_pred))
print("Coefficients: ", model.coef_)
print("Intercept: ", model.intercept_)

MSE:  0.20263267251498907
R-Squared:  0.7775582458195478
Coefficients:  [0.41561241 0.12952665 0.13842676 0.08192792 0.08426852 0.07231967
 0.03516167]
Intercept:  -0.001130917732918514


In [4]:
# Calculate correlation between C and Y1
correlation = train_data['C'].corr(train_data['Y1'])
print(f"Correlation between C and Y1: {correlation:.4f}")

Correlation between C and Y1: 0.7038


Clearly there's a strong relationship between C and Y1. You should definitely use C to predict Y1!

## 3) random forest on both

You should now be able to submit preds.csv to [https://quantchallenge.org/dashboard/data/upload-predictions](https://quantchallenge.org/dashboard/data/upload-predictions)! Note that you should receive a public $R^2$ score of $-0.042456$ with this set of predictions. You should try to get the highest possible $R^2$ score over the course of these next few days. Be careful of overfitting to the public score, which is only calculated on a subset of the test data—the final score that counts is the private $R^2$ score!