# 데이터 분석 실무
- 파이썬 라이브러리를 활용해서 통계 기법을 적용한 데이터 분석을 배우는 과정입니다.

## 회귀분석1 : 마케팅 데이터 분석(단순회귀분석)

- 지금부터 간단한 예시코드를 중심으로 데이터 분석 실습을 진행하겠습니다.
- 마케팅 데이터에서 변수간 상관분석을 위한 가상 데이터셋을 만들어보겠습니다.
- A 제조기업에서는 서비스별 마케팅 관련 데이터를 수집하고 있습니다. 여러 변수를 수집하고 있는데요. 그 중 광고가 판매량에 미치는 영향에 대해 궁금합니다.
    1. Advertising: 광고 노출
    2. Sales: 판매량

### 필요 라이브러리 불러오기

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

### 데이터 생성

In [4]:
# 마케팅 데이터를 포함한 데이터프레임 생성
np.random.seed(0)

# Advertising 데이터 생성
advertising = np.random.normal(50, 10, 100)
# 종속 변수 (영향을 받는 변수)
sales = 50 + 0.3 * advertising + 0.7 + np.random.normal(0, 10, 100)

# 데이터프레임 생성
data = pd.DataFrame({
    'Advertising': advertising,
    'Sales': sales,
})

### 독립변수/종속변수 설정

In [3]:
# 독립 변수와 종속 변수 설정
X = data['Advertising']  # 광고 예산
y = data['Sales']  # 판매량

### 상수항 추가
- 상수항은 회귀분석 모델에서 상수항 또는 절편(intercept)을 의미합니다. 이는 독립 변수가 0일 때 종속 변수의 값을 나타내는데, 회귀분석 모델은 일반적으로 독립 변수의 값을 통해 종속 변수의 값을 예측하고 설명하는 것이 목적입니다.
- 즉 모델의 기준이 되어주는 변수를 추가해주는거라고 이해하시면 좀 더 쉽습니다.
- 일반적으로 회귀분석 모델에는 상수항을 추가하여 독립 변수와 종속 변수 간의 관계를 더 정확하게 모델링하는 것이 일반적입니다.

In [4]:
# 상수항 추가
X = sm.add_constant(X)

### 회귀식 산출

In [5]:
# 단순 선형 회귀 모델 적합
model = sm.OLS(y, X)
results = model.fit()

### 모델 결과 출력

In [6]:
# 결과 출력
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  Sales   R-squared:                       0.142
Model:                            OLS   Adj. R-squared:                  0.133
Method:                 Least Squares   F-statistic:                     16.19
Date:                Thu, 29 Jun 2023   Prob (F-statistic):           0.000113
Time:                        15:50:05   Log-Likelihood:                -374.93
No. Observations:                 100   AIC:                             753.9
Df Residuals:                      98   BIC:                             759.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          45.7166      5.317      8.599      

### 결과 예측

In [15]:
results.predict([1.0, 67.640523])

array([73.76702811])

In [17]:
0.4147 * 67.640523 + 45.7166

73.7671248881

## 회귀분석1 - 다른 방법

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [5]:
np.random.seed(0)
advertising = np.random.normal(50, 10, 100)
sales = 50 + 0.3 * advertising + 0.7 + np.random.normal(0, 10, 100)

df = pd.DataFrame({
    'Advertising': advertising,
    'Sales': sales,
})

In [6]:
df.head(2)

Unnamed: 0,Advertising,Sales
0,67.640523,89.823664
1,54.001572,53.422881


In [7]:
model = ols(formula='Sales ~ Advertising', data=data).fit()

In [9]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Sales   R-squared:                       0.142
Model:                            OLS   Adj. R-squared:                  0.133
Method:                 Least Squares   F-statistic:                     16.19
Date:                Thu, 29 Jun 2023   Prob (F-statistic):           0.000113
Time:                        15:56:10   Log-Likelihood:                -374.93
No. Observations:                 100   AIC:                             753.9
Df Residuals:                      98   BIC:                             759.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      45.7166      5.317      8.599      

In [16]:
xtrain = pd.DataFrame([67.640523], columns=['Advertising'])

In [17]:
model.predict(xtrain)

0    73.767028
dtype: float64

In [18]:
model.params

Intercept      45.716609
Advertising     0.414698
dtype: float64

In [20]:
pred = model.get_prediction(xtrain)

In [21]:
pred.summary_frame()

Unnamed: 0,mean,mean_se,mean_ci_lower,mean_ci_upper,obs_ci_lower,obs_ci_upper
0,73.767028,2.040367,69.717987,77.81607,52.761857,94.772199


## 회귀분석2 : 마케팅 데이터 분석(다중회귀분석)

- 지금부터 간단한 예시코드를 중심으로 데이터 분석 실습을 진행하겠습니다.
- 마케팅 데이터에서 변수간 상관분석을 위한 가상 데이터셋을 만들어보겠습니다.
- A 제조기업에서는 서비스별 마케팅 관련 데이터를 수집하고 있습니다. 여러 변수를 수집하고 있는데요. 광고와 웹 트래픽, 소셜미디어 업로드 수가 판매량에 미치는 영향이 궁금합니다.
    1. Advertising: 광고 노출
    2. Sales: 판매량
    3. Website_Traffic: 웹사이트 내 서비스 페이지 트래픽
    4. Social_Media: 소셜미디어 업로드 수

### 필요 라이브러리 불러오기

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

### 데이터 생성

In [2]:
# 마케팅 데이터를 포함한 데이터프레임 생성
np.random.seed(0)

# Advertising 데이터 생성
advertising = np.random.normal(50, 10, 100)
# Website_Traffic 데이터 생성
website_traffic = np.random.normal(1000, 200, 100)
# Social_Media 데이터 생성
social_media = np.random.normal(500, 100, 100)
# 종속 변수 (영향을 받는 변수)
sales = 50 + 0.3 * advertising + 0.7 * website_traffic + 0.2 * social_media + np.random.normal(0, 10, 100)

# 데이터프레임 생성
data = pd.DataFrame({
    'Advertising': advertising,
    'Sales': sales,
    'Website_Traffic': website_traffic,
    'Social_Media': social_media,
})

### 독립변수/종속변수 설정

In [3]:
# 독립 변수와 종속 변수 분리
X = data[['Advertising', 'Website_Traffic', 'Social_Media']]
y = data['Sales']

### 상수항 추가
- 상수항은 회귀분석 모델에서 상수항 또는 절편(intercept)을 의미합니다. 이는 독립 변수가 0일 때 종속 변수의 값을 나타내는데, 회귀분석 모델은 일반적으로 독립 변수의 값을 통해 종속 변수의 값을 예측하고 설명하는 것이 목적입니다.
- 즉 모델의 기준이 되어주는 변수를 추가해주는거라고 이해하시면 좀 더 쉽습니다.
- 일반적으로 회귀분석 모델에는 상수항을 추가하여 독립 변수와 종속 변수 간의 관계를 더 정확하게 모델링하는 것이 일반적입니다.

In [4]:
# 상수항 추가
X = sm.add_constant(X)

In [5]:
X

Unnamed: 0,const,Advertising,Website_Traffic,Social_Media
0,1.0,67.640523,1376.630139,463.081816
1,1.0,54.001572,730.448188,476.062082
2,1.0,59.787380,745.903000,609.965960
3,1.0,72.408932,1193.879342,565.526373
4,1.0,68.675580,765.375319,564.013153
...,...,...,...,...
95,1.0,57.065732,965.690734,613.689136
96,1.0,50.105000,1154.358110,509.772497
97,1.0,67.858705,1164.700831,558.295368
98,1.0,51.269121,1432.647190,460.055097


### 회귀식 산출

In [6]:
# 다중 회귀분석 모델 학습
model = sm.OLS(y, X)
results = model.fit()

### 모델 결과 출력

In [7]:
# 회귀분석 결과 출력
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  Sales   R-squared:                       0.996
Model:                            OLS   Adj. R-squared:                  0.996
Method:                 Least Squares   F-statistic:                     8385.
Date:                Thu, 29 Jun 2023   Prob (F-statistic):          5.43e-116
Time:                        16:01:20   Log-Likelihood:                -363.46
No. Observations:                 100   AIC:                             734.9
Df Residuals:                      96   BIC:                             745.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              41.9166      7.641     

### 결과 예측

In [14]:
xtrain = pd.DataFrame(np.array([1.0, 67.71, 1380.99, 480.099]).reshape(1, -1), columns=['const', 'Advertising', 'Website_Traffic', 'Social_Media'])
xtrain

Unnamed: 0,const,Advertising,Website_Traffic,Social_Media
0,1.0,67.71,1380.99,480.099


In [15]:
results.predict(xtrain)

0    1130.93458
dtype: float64

### 다중공선성 확인

In [16]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [17]:
# VIF 계산
vif = pd.DataFrame()
vif["Feature"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# VIF 결과 출력
print(vif)

           Feature        VIF
0            const  66.688774
1      Advertising   1.017686
2  Website_Traffic   1.014977
3     Social_Media   1.008147


### 회귀분석2 - 다른 방법

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [7]:
np.random.seed(0)
advertising = np.random.normal(50, 10, 100)
website_traffic = np.random.normal(1000, 200, 100)
social_media = np.random.normal(500, 100, 100)
sales = 50 + 0.3 * advertising + 0.7 * website_traffic + 0.2 * social_media + np.random.normal(0, 10, 100)

df = pd.DataFrame({
    'Advertising': advertising,
    'Sales': sales,
    'Website_Traffic': website_traffic,
    'Social_Media': social_media,
})

In [8]:
df.head(2)

Unnamed: 0,Advertising,Sales,Website_Traffic,Social_Media
0,67.640523,1113.484349,1376.630139,463.081816
1,54.001572,689.307926,730.448188,476.062082


In [9]:
xtrain = pd.DataFrame(np.array([67.71, 1380.99, 480.099]).reshape(1, -1), columns=['Advertising', 'Website_Traffic', 'Social_Media'])
xtrain

Unnamed: 0,Advertising,Website_Traffic,Social_Media
0,67.71,1380.99,480.099


In [10]:
model = ols(formula='Sales ~ Advertising + Website_Traffic + Social_Media', data=df).fit()

In [11]:
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.996
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,8385.0
Date:,"Thu, 29 Jun 2023",Prob (F-statistic):,5.43e-116
Time:,16:17:06,Log-Likelihood:,-363.46
No. Observations:,100,AIC:,734.9
Df Residuals:,96,BIC:,745.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,41.9166,7.641,5.486,0.000,26.750,57.083
Advertising,0.2277,0.094,2.431,0.017,0.042,0.414
Website_Traffic,0.7035,0.005,154.449,0.000,0.694,0.713
Social_Media,0.2125,0.010,21.535,0.000,0.193,0.232

0,1,2,3
Omnibus:,0.37,Durbin-Watson:,2.076
Prob(Omnibus):,0.831,Jarque-Bera (JB):,0.537
Skew:,0.07,Prob(JB):,0.765
Kurtosis:,2.67,Cond. No.,9370.0


In [13]:
model.params

Intercept          41.916557
Advertising         0.227672
Website_Traffic     0.703536
Social_Media        0.212510
dtype: float64

In [14]:
model.predict(xtrain)

0    1130.93458
dtype: float64

In [15]:
pred = model.get_prediction(xtrain)

In [16]:
pred.summary_frame()

Unnamed: 0,mean,mean_se,mean_ci_lower,mean_ci_upper,obs_ci_lower,obs_ci_upper
0,1130.93458,2.387269,1126.195888,1135.673273,1111.767484,1150.101677


In [43]:
X = df[['Advertising', 'Website_Traffic', 'Social_Media']]

In [99]:
temp = pd.DataFrame()

for idx, c in zip(range(0, len(X.columns)), X.columns):
    temp.loc[0, c] = variance_inflation_factor(X, idx)
    # print(f'{idx} - {c} - {variance_inflation_factor(X.values, idx)}')

temp

Unnamed: 0,Advertising,Website_Traffic,Social_Media
0,19.735083,18.910171,19.160273


## 다중공선성 식별

In [3]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [34]:
np.random.seed(0)
advertising = np.random.normal(50, 10, 100)
website_traffic = np.random.normal(1000, 200, 100)
social_media = np.random.normal(500, 100, 100)
sales = 50 + 0.3 * advertising + 0.7 * website_traffic + 0.2 * social_media + np.random.normal(0, 10, 100)

df = pd.DataFrame({
    'Advertising': advertising,
    'Sales': sales,
    'Website_Traffic': website_traffic,
    'Social_Media': social_media,
})

In [35]:
X = df[['Advertising', 'Website_Traffic', 'Social_Media']]
y = df['Sales']

In [36]:
X = sm.add_constant(X)
X.head(2)

Unnamed: 0,const,Advertising,Website_Traffic,Social_Media
0,1.0,67.640523,1376.630139,463.081816
1,1.0,54.001572,730.448188,476.062082


In [37]:
temp = pd.DataFrame(columns=['Variable', 'VIF'])
temp['Variable'] = X.columns
temp['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
temp

Unnamed: 0,Variable,VIF
0,const,66.688774
1,Advertising,1.017686
2,Website_Traffic,1.014977
3,Social_Media,1.008147


In [46]:
X_2 = df[['Advertising', 'Website_Traffic', 'Social_Media']]

In [47]:
X_2['constraint'] = [2.0] * len(X_2)

In [48]:
temp_2 = pd.DataFrame(columns=['Variable', 'VIF'])
temp_2['Variable'] = X_2.columns
temp_2['VIF'] = [variance_inflation_factor(X_2.values, i) for i in range(len(X_2.columns))]
temp_2

Unnamed: 0,Variable,VIF
0,Advertising,1.017686
1,Website_Traffic,1.014977
2,Social_Media,1.008147
3,constraint,66.688774


In [40]:
model = sm.OLS(y, X).fit()

In [41]:
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.996
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,8385.0
Date:,"Thu, 29 Jun 2023",Prob (F-statistic):,5.43e-116
Time:,17:10:56,Log-Likelihood:,-363.46
No. Observations:,100,AIC:,734.9
Df Residuals:,96,BIC:,745.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,41.9166,7.641,5.486,0.000,26.750,57.083
Advertising,0.2277,0.094,2.431,0.017,0.042,0.414
Website_Traffic,0.7035,0.005,154.449,0.000,0.694,0.713
Social_Media,0.2125,0.010,21.535,0.000,0.193,0.232

0,1,2,3
Omnibus:,0.37,Durbin-Watson:,2.076
Prob(Omnibus):,0.831,Jarque-Bera (JB):,0.537
Skew:,0.07,Prob(JB):,0.765
Kurtosis:,2.67,Cond. No.,9370.0


In [42]:
model_2 = ols(formula='Sales ~ + Advertising + Website_Traffic + Social_Media', data=df).fit()

In [43]:
model_2.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.996
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,8385.0
Date:,"Thu, 29 Jun 2023",Prob (F-statistic):,5.43e-116
Time:,17:11:30,Log-Likelihood:,-363.46
No. Observations:,100,AIC:,734.9
Df Residuals:,96,BIC:,745.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,41.9166,7.641,5.486,0.000,26.750,57.083
Advertising,0.2277,0.094,2.431,0.017,0.042,0.414
Website_Traffic,0.7035,0.005,154.449,0.000,0.694,0.713
Social_Media,0.2125,0.010,21.535,0.000,0.193,0.232

0,1,2,3
Omnibus:,0.37,Durbin-Watson:,2.076
Prob(Omnibus):,0.831,Jarque-Bera (JB):,0.537
Skew:,0.07,Prob(JB):,0.765
Kurtosis:,2.67,Cond. No.,9370.0
