# Difference-in-differences (DiD)

**Treatment period:**
\begin{equation}
Y_{it} = \beta_0 + \beta_1 D_i + \varepsilon_{it}
\end{equation}

$Y_{it}$ is the price for observation $i$ in time period $t$, 
$D_i$ is a dummy variable that takes the value of 1 during the treatment period and 0 otherwise, and 
$\beta_1$ captures the average treatment effect over time

**Treatment city:**
\begin{equation}
Y_{it} = \beta_0 + \beta_2 T_i + \varepsilon_{it}
\end{equation}

$T_i$ is a dummy variable that takes the value of 1 for the treatment city and 0 otherwise, 
$\beta_2$ captures the average difference in prices between the treatment city and the control cities

**Both (final equation):**
\begin{equation}
Y_{it} = \beta_0 + \beta_1 D_i + \beta_2 T_i + \alpha D_i T_i + \varepsilon_{it}
\end{equation}

$T_i$ is the treatment period dummy,
$D_i$ is the treatment city, 
$\alpha$ is the one that captures the DiD estimate, which is the differential change in prices for the treatment group compared to the control group, and
where $T_i = 1$ if individual $i$ is treatment period $t=1$ and $T_i=0$ otherwise


**Why use a second city**

The second city plays the role of a control group which is crucial for DiD analysis because it helps control for time-invariant unobserved factors that may affect prices in both the treatment and control groups. The key assumption in a DiD analysis is the parallel trends assumption, which implies that, in the absence of the treatment, the average outcomes for the treatment and control groups would follow parallel paths over time. Violation of this assumption could lead to biased estimates.


**Why $\alpha$ captures the treatment effect:**

If $\alpha$ is statistically significant and positive, it suggests that the treatment (event) had a differential impact on prices in the treatment city compared to the control cities. This coefficient captures the average treatment effect on prices that cannot be explained by time trends or differences between the treatment and control cities.



In [28]:
import pandas as pd
import statsmodels.api as sm

In [29]:
def openbarcelona(path):
    ''' Function to open 2 dataframes:
    - df1 which is the df of Barcelona the week of the event 
    - df2 which is the df of Barcelona another random no-event week
    '''
    df1 = pd.read_csv(path + '/barcelona_p1.csv')
    df2 = pd.read_csv(path + '/barcelona_p2.csv')
    return df1, df2

In [30]:
df1, df2 = openbarcelona('C:/Users/arimi/Documents/BSE-term2/text-mining/Booking-Scraping/data')

In [31]:
def choosecity(path, city):
    ''' Function to open 2 dataframes:
    - df1 which is the df of the choosen city the week of the event 
    - df2 which is the df of the choosen city another random no-event week
    '''
    df3 = pd.read_csv(path + city + '_p1.csv')
    df4 = pd.read_csv(path + city + '_p2.csv')
    return df3, df4

In [32]:
#Put the city all in low case
df3, df4 = choosecity('C:/Users/arimi/Documents/BSE-term2/text-mining/Booking-Scraping/data/', 'porto') 

In [33]:
# Eliminate the euro sign from the price feature and then transforming from object to numeric
df1['price'] = pd.to_numeric(df1['price'].replace('[^\d]', '', regex=True))
df2['price'] = pd.to_numeric(df2['price'].replace('[^\d]', '', regex=True))
df3['price'] = pd.to_numeric(df3['price'].replace('[^\d]', '', regex=True))
df4['price'] = pd.to_numeric(df4['price'].replace('[^\d]', '', regex=True))

In [34]:
df1.head()

Unnamed: 0.1,Unnamed: 0,place,start_date,end_date,name,price,description_short,rating,url,description
0,0,Barcelona,2024-02-23,2024-03-03,Duquesa Suites Barcelona,3491,,8.8,https://www.booking.com/hotel/es/duquesa-suite...,"Set the centre of Barcelona, 400 metres from P..."
1,1,Barcelona,2024-02-23,2024-03-03,Sonder Casa Luz,4457,,8.4,https://www.booking.com/hotel/es/casa-luz-barc...,"Set in Barcelona, Sonder Casa Luz offers a ter..."
2,2,Barcelona,2024-02-23,2024-03-03,Valencia 2,1395,,,https://www.booking.com/hotel/es/valencia-2.en...,"Located in Barcelona, 1.2 km from Passeig de G..."
3,3,Barcelona,2024-02-23,2024-03-03,Fuster Apartments by Aspasios,2448,Entire apartment • 2 bedrooms • 1 living room ...,9.2,https://www.booking.com/hotel/es/fuster-apartm...,Fuster Apartments are just 150 metres from Dia...
4,4,Barcelona,2024-02-23,2024-03-03,BarcelonaForRent The Central Place,5350,Entire apartment • 1 bedroom • 1 living room •...,8.5,https://www.booking.com/hotel/es/barcelonaforr...,"Offering views of Casa Batlló, BarcelonaForRen..."


In [35]:
df1.describe()

Unnamed: 0.1,Unnamed: 0,price,rating
count,1001.0,1001.0,945.0
mean,500.0,3379.912088,7.852804
std,289.108111,1836.85174,1.127298
min,0.0,601.0,1.0
25%,250.0,2317.0,7.3
50%,500.0,2999.0,8.0
75%,750.0,3903.0,8.5
max,1000.0,17158.0,10.0


In [36]:
df1.shape

(1001, 10)

In [37]:
#Creation of the dummy variables for the regression
# D is treatment city
# T is treatment period
df1['D'] = 1
df1['T'] = 1
df2['D'] = 0
df2['T'] = 1
df3['D'] = 1
df3['T'] = 0
df4['D'] = 0
df4['T'] = 0

In [38]:
#Creation of the big data frame
combined_df = pd.concat([df1, df2, df3, df4], ignore_index=True)
combined_df['T*D'] = combined_df['T'] * combined_df['D']
# Display the combined DataFrame
combined_df.head()

Unnamed: 0.1,Unnamed: 0,place,start_date,end_date,name,price,description_short,rating,url,description,D,T,T*D
0,0,Barcelona,2024-02-23,2024-03-03,Duquesa Suites Barcelona,3491,,8.8,https://www.booking.com/hotel/es/duquesa-suite...,"Set the centre of Barcelona, 400 metres from P...",1,1,1
1,1,Barcelona,2024-02-23,2024-03-03,Sonder Casa Luz,4457,,8.4,https://www.booking.com/hotel/es/casa-luz-barc...,"Set in Barcelona, Sonder Casa Luz offers a ter...",1,1,1
2,2,Barcelona,2024-02-23,2024-03-03,Valencia 2,1395,,,https://www.booking.com/hotel/es/valencia-2.en...,"Located in Barcelona, 1.2 km from Passeig de G...",1,1,1
3,3,Barcelona,2024-02-23,2024-03-03,Fuster Apartments by Aspasios,2448,Entire apartment • 2 bedrooms • 1 living room ...,9.2,https://www.booking.com/hotel/es/fuster-apartm...,Fuster Apartments are just 150 metres from Dia...,1,1,1
4,4,Barcelona,2024-02-23,2024-03-03,BarcelonaForRent The Central Place,5350,Entire apartment • 1 bedroom • 1 living room •...,8.5,https://www.booking.com/hotel/es/barcelonaforr...,"Offering views of Casa Batlló, BarcelonaForRen...",1,1,1


In [39]:
combined_df.describe() #Checked if it worked 

Unnamed: 0.1,Unnamed: 0,price,rating,D,T,T*D
count,3979.0,3979.0,3745.0,3979.0,3979.0,3979.0
mean,496.933903,1734.719779,8.252363,0.503141,0.503141,0.251571
std,287.296693,1501.870415,1.006695,0.500053,0.500053,0.43397
min,0.0,253.0,1.0,0.0,0.0,0.0
25%,248.0,708.5,7.8,0.0,0.0,0.0
50%,497.0,1223.0,8.4,1.0,1.0,0.0
75%,745.5,2288.5,8.9,1.0,1.0,1.0
max,1000.0,17158.0,10.0,1.0,1.0,1.0


In [40]:
all = ['T', 'D', 'T*D'] 

# Function to estimate regression and return coefficients
def estimate_regression(df, column_name):
    model = sm.OLS(df['price'], sm.add_constant(df[column_name])).fit()
    coeff = model.params
    return coeff

# Regression 1: Treatment Period Dummy Only
regression_df1 = estimate_regression(combined_df, 'D')

# Regression 2: Treatment City Dummy Only
regression_df2 = estimate_regression(combined_df, 'T')

# Regression 3: Both Treatment Period and City Dummy with Interaction (DiD)
regression_df3 = estimate_regression(combined_df, all)

# Display the regression table
print(regression_df1)
print(regression_df2)
print(regression_df3)


const    1384.692969
D         695.682655
dtype: float64
const     832.890238
T        1792.397475
dtype: float64
const     886.274590
T         984.388746
D        -105.435429
T*D      1614.684181
dtype: float64
