# Regresiones Lineales con Variables Categóricas

---

Para el modelo de regresión lineal generado a partir del Dataset de "Ecom Expanse.csv" contestar lo siguiente:

1. ¿Qué pasaría si integramos la variable "Record" al modelo de predicción? ¿Mejoraría con esto la bondad de ajuste del modelo?
2. ¿Existe alguna variable que pueda estar impactando en forma negativa las predicciones de nuestro modelo?

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

In [2]:
# Cargaremos el dataset "Ecom Expense.csv"
df = pd.read_csv("../../datasets/ecom-expense/Ecom Expense.csv")
df.head()

Unnamed: 0,Transaction ID,Age,Items,Monthly Income,Transaction Time,Record,Gender,City Tier,Total Spend
0,TXN001,42,10,7313,627.668127,5,Female,Tier 1,4198.385084
1,TXN002,24,8,17747,126.904567,3,Female,Tier 2,4134.976648
2,TXN003,47,11,22845,873.469701,2,Male,Tier 2,5166.614455
3,TXN004,50,11,18552,380.219428,7,Female,Tier 1,7784.447676
4,TXN005,60,2,14439,403.374223,2,Female,Tier 2,3254.160485


In [4]:
df["Gender"] = df["Gender"].replace({"Female": 0, "Male": 1}).astype("Int8")
df["City Tier"] = df["City Tier"].replace({"Tier 1": 0, "Tier 2": 1, "Tier 3": 2}).astype("Int8")
df = df.iloc[:, 1:]

In [5]:
df.head()

Unnamed: 0,Items,Monthly Income,Transaction Time,Record,Gender,City Tier,Total Spend
0,10,7313,627.668127,5,0,0,4198.385084
1,8,17747,126.904567,3,0,1,4134.976648
2,11,22845,873.469701,2,1,1,5166.614455
3,11,18552,380.219428,7,0,0,7784.447676
4,2,14439,403.374223,2,0,1,3254.160485


In [6]:
X = df[df.columns[:-1]]
Y = df["Total Spend"]

In [7]:
lm = LinearRegression()
lm.fit(X,Y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [8]:
print(lm.intercept_)
print(lm.coef_)

-433.39279433600586
[ 3.96967861e+01  1.47806941e-01  1.67008341e-01  7.71256787e+02
  2.59339259e+02 -9.41091069e+01]


In [9]:
list(zip(X.columns, lm.coef_))

[(' Items ', np.float64(39.696786083182594)),
 ('Monthly Income', np.float64(0.14780694080312073)),
 ('Transaction Time', np.float64(0.16700834115295704)),
 ('Record', np.float64(771.2567870364602)),
 ('Gender', np.float64(259.33925918551523)),
 ('City Tier', np.float64(-94.10910693044877))]

In [10]:
lm.score(X,Y)

0.9214763195081017

In [11]:
y_pred = lm.predict(X)

In [12]:
df["Y_predicted"] = y_pred
df.head(10)

Unnamed: 0,Items,Monthly Income,Transaction Time,Record,Gender,City Tier,Total Spend,Y_predicted
0,10,7313,627.668127,5,0,0,4198.385084,5005.596972
1,8,17747,126.904567,3,0,1,4134.976648,4748.166648
2,11,22845,873.469701,2,1,1,5166.614455,5233.541867
3,11,18552,380.219428,7,0,0,7784.447676,8207.683544
4,2,14439,403.374223,2,0,1,3254.160485,3295.956523
5,6,6282,48.974268,2,1,1,2375.036467,2449.233962
6,14,7086,961.203768,8,1,0,7494.474559,7759.644796
7,9,8881,962.25374,10,1,2,10782.94492,9180.945038
8,6,5635,858.328132,5,1,0,3854.277411,4896.651185
9,12,20861,43.036737,4,0,1,5346.140262,6124.474766
