# Getting Started: Market Research
This Jupyter notebook is a quick demonstration on how to get started on the market research section.

## 1) Holistic Regression (all features)


In [None]:
import pandas as pd
import csv
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

train_data = pd.read_csv('./data/train.csv')
test_data = pd.read_csv('./data/test.csv')

#independent and dependent variables
x = train_data[['A','B','C','D','E','F','G','H','I','J','K','L','M','N']]
y = train_data[['Y1','Y2']]


#splitting data into train and test
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=42)

#running regression
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

#checking model performance
print("MSE: ",mean_squared_error(y_test, y_pred))
print("R-Squared: ", r2_score(y_test, y_pred))
print("Coefficients: ", model.coef_)
print("Intercept: ", model.intercept_)



MSE:  0.2610423252491223
R-Squared:  0.7094833926126334
Coefficients:  [[ 0.00664259 -0.01308609  0.0822096   0.00569229  0.07101493  0.0005716
   0.41602123  0.13716385 -0.00059139  0.13001162 -0.01014076 -0.00247624
   0.08376823  0.03463404]
 [ 0.22750768  0.13022678  0.003089    0.17574207 -0.00098952  0.1216997
   0.01769992 -0.00551329  0.03399319 -0.01844861  0.18472468  0.09821982
  -0.02112275 -0.00258247]]
Intercept:  [-0.00179489 -0.09241293]


## 2) Y1 Regression (correlated features only)

In [11]:
#y1 only training data
y1_x = train_data[['G','J','H','C','M','E','N']]
y1_y = train_data['Y1']


#splitting data into train and test
x_train, x_test, y_train, y_test = train_test_split(y1_x,y1_y,test_size=0.2, random_state=42)

#running regression
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

#checking model performance
print("MSE: ",mean_squared_error(y_test, y_pred))
print("R-Squared: ", r2_score(y_test, y_pred))
print("Coefficients: ", model.coef_)
print("Intercept: ", model.intercept_)

MSE:  0.20263267251498907
R-Squared:  0.7775582458195478
Coefficients:  [0.41561241 0.12952665 0.13842676 0.08192792 0.08426852 0.07231967
 0.03516167]
Intercept:  -0.001130917732918514


In [4]:
# Calculate correlation between C and Y1
correlation = train_data['C'].corr(train_data['Y1'])
print(f"Correlation between C and Y1: {correlation:.4f}")

Correlation between C and Y1: 0.7038


Clearly there's a strong relationship between C and Y1. You should definitely use C to predict Y1!

## 3) Submit Predictions
In order to submit predictions, we need to make a CSV file with three columns: id, Y1, and Y2. In the below example, we let our predictions of Y1 and Y2 be the means of Y1 and Y2 in the train set.

In [None]:
preds = test_data[['id']]
preds['Y1'] = train_data['Y1'].mean()
preds['Y2'] = train_data['Y2'].mean()
preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preds['Y1'] = train_data['Y1'].mean()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preds['Y2'] = train_data['Y2'].mean()


Unnamed: 0,id,Y1,Y2
0,1,-0.002807,-0.061172
1,2,-0.002807,-0.061172
2,3,-0.002807,-0.061172
3,4,-0.002807,-0.061172
4,5,-0.002807,-0.061172
...,...,...,...
15991,15992,-0.002807,-0.061172
15992,15993,-0.002807,-0.061172
15993,15994,-0.002807,-0.061172
15994,15995,-0.002807,-0.061172


In [6]:
# save preds to csv
preds.to_csv('preds.csv', index=False)

You should now be able to submit preds.csv to [https://quantchallenge.org/dashboard/data/upload-predictions](https://quantchallenge.org/dashboard/data/upload-predictions)! Note that you should receive a public $R^2$ score of $-0.042456$ with this set of predictions. You should try to get the highest possible $R^2$ score over the course of these next few days. Be careful of overfitting to the public score, which is only calculated on a subset of the test data—the final score that counts is the private $R^2$ score!