# Assignment: Exploratory Data Analysis
### `! git clone https://github.com/ds3001f25/eda_assignment.git`
### Do Q1, Q2, and Q3.

**Q1.** In class, we talked about how to compute the sample mean of a variable $X$,
$$
m(X) = \dfrac{1}{N} \sum_{i=1}^N x_i
$$
and sample covariance of two variables $X$ and $Y$,
$$
\text{cov}(X,Y) = \dfrac{1}{N} \sum_{i=1}^N (x_i - m(X))(y_i - m(Y))).
$$
Recall, the sample variance of $X$ is
$$
s^2 = \dfrac{1}{N} \sum_{i=1}^N (x_i - m(X))^2.
$$
It can be very helpful to understand some basic properties of these statistics. If you want to write your calculations on a piece of paper, take a photo, and upload that to your GitHub repo, that's probably easiest.

1. Show that $m(a + bX) = a+b \times m(X)$.
2. Show that $\text{cov}(X,a+bY) = b \times \text{cov}(X,Y)$
3. Show that $\text{cov}(a+bX,a+bX) = b^2 \text{cov}(X,X) $, and in particular that $\text{cov}(X,X) = s^2 $.
4. Instead of the mean, consider the median. Consider transformations that are non-decreasing (if $x\ge x'$, then $g(x)\ge g(x')$), like $2+5 \times X$ or $\text{arcsinh}(X)$. Is a non-decreasing transformation of the median the median of the transformed variable? Explain. Does your answer apply to any quantile? The IQR? The range?
5. Consider a non-decreasing transformation $g()$. Is is always true that $m(g(X))= g(m(X))$?

1. $m(a+bX)=\frac{1}{N}∑(a+bx)_i=\frac{1}{N}a+b∑x_i=a+b(\frac{1}{N}∑x_i)=a+b×(X)$

---

2. $\text{cov}(X,a+bY)=\dfrac{1}{N}\sum_{i=1}^N(x_i-m(X))((a+by)_i-m(a+bY)
=\dfrac{1}{N}\sum_{i=1}^N(x_i-m(X))(a+by_i-a-b×m(Y))
=\dfrac{1}{N}\sum_{i=1}^N(x_i-m(X))(by_i-b×m(Y))
=\dfrac{1}{N}\sum_{i=1}^N(x_i-m(X))b×(y_im(Y))
=\dfrac{1}{N}b×\sum_{i=1}^N(x_i-m(X))(y_im(Y))
=b×(\dfrac{1}{N}\sum_{i=1}^N(x_i-m(X))(y_im(Y)))
=b×\text{cov}(X,Y)$

---

3. $\text{cov}(a+bX,a+bX)=\dfrac{1}{N}\sum_{i=1}^N((a+bx)_i-m(a+bX))((a+bx)_i-m(a+bX)
=\dfrac{1}{N}\sum_{i=1}^N(a+bx_i-a-bm(X))(a+bx_i-a-b×m(X))
=\dfrac{1}{N}\sum_{i=1}^N(bx_i-b×m(X))(bx_i-b×m(X))
=\dfrac{1}{N}\sum_{i=1}^Nb×(x_i-m(X))b×(x_im(X))
=\dfrac{1}{N}b^2×\sum_{i=1}^N(x_i-m(X))(x_im(X))
=b^2×(\dfrac{1}{N}\sum_{i=1}^N(x_i-m(X))(x_im(X)))
=b^2×\text{cov}(X,X)$

Also, $\text{cov}(X,X)=\dfrac{1}{N}\sum_{i=1}^N(x_i-m(X))(x_i-m(X)= \dfrac{1}{N} \sum_{i=1}^N (x_i - m(X))^2=s^2$

---

4. Let X̃ be the median of the original random variable X, then the median of the random variable g(X) that have transformed non-decreasingly from X would be g(X̃) as we shift the whole dataset through the non-decreasing transformation g(.). As a result, the median also had an non-decreasing transformation.

For the quantile, same thing happens. Suppose $Q_{1,x}, Q_{2,x}, Q_{3,x}, Q_{4,x}$ are the respective quartiles for random variable X. Then the transformed Q1 would be $g(Q_{1,x})$ and so on for Q2, Q3, and Q4,  because we are shifting the whole dataset through the non-decreasing g(.), the quantiles would be non-decreasing.

Then, for IQR, the original IQR = $Q_{3,x}- Q_{1,x}$ and the transformed IQR = $g(Q_{3,x}) - g(Q_{1,x})$. In this case, since the transformation is non-decreasing, and we know that $Q_{3,x} > Q_{1,x}$, so $g(Q_{3,x}) > g(Q_{1,x})$. However, since we don't know how the transformation perform across the domain (like if it is linear), we cannot tell if the transformed IQR would be bigger or smaller or the same as the original IQR, so we cannot tell if the IQR would be non-decreasing.

Same for the range, though we are sure that the original range = max(X) - min(X), and the transformed range = g(max(X)) - g(min(X)), we don't know how the transformation perform, and thus cannot tell if the range would be non-decreasing or not.

---

5. $m(g(X))=\frac{1}{N}∑g(x)_i$. In this case, since we don't know how the transformation perform across the domain (like if it is linear), we cannot tell if the transformed mean would be bigger or smaller or the same as the original mean, so we cannot tell if the mean would be non-decreasing.

---

**Q2.** This question uses the Airbnb data to practice making visualizations.

  1. Load the `./data/airbnb_hw.csv` data with Pandas. This provides a dataset of AirBnB rental properties for New York City.  

In [None]:
import pandas as pd
airbnb = pd.read_csv("https://raw.githubusercontent.com/xec9cp/eda_assignment/refs/heads/main/data/airbnb_hw.csv")

2. What are are the dimensions of the data? How many observations are there? What are the variables included? Use `.head()` to examine the first few rows of data.

In [8]:
print(airbnb.shape) #there are 30478 observations, with 13 columns/variables
print(airbnb.columns) #this shows the variables included
airbnb.head()

(30478, 13)
Index(['Host Id', 'Host Since', 'Name', 'Neighbourhood ', 'Property Type',
       'Review Scores Rating (bin)', 'Room Type', 'Zipcode', 'Beds',
       'Number of Records', 'Number Of Reviews', 'Price',
       'Review Scores Rating'],
      dtype='object')


Unnamed: 0,Host Id,Host Since,Name,Neighbourhood,Property Type,Review Scores Rating (bin),Room Type,Zipcode,Beds,Number of Records,Number Of Reviews,Price,Review Scores Rating
0,5162530,,1 Bedroom in Prime Williamsburg,Brooklyn,Apartment,,Entire home/apt,11249.0,1.0,1,0,145,
1,33134899,,"Sunny, Private room in Bushwick",Brooklyn,Apartment,,Private room,11206.0,1.0,1,1,37,
2,39608626,,Sunny Room in Harlem,Manhattan,Apartment,,Private room,10032.0,1.0,1,1,28,
3,500,6/26/2008,Gorgeous 1 BR with Private Balcony,Manhattan,Apartment,,Entire home/apt,10024.0,3.0,1,0,199,
4,500,6/26/2008,Trendy Times Square Loft,Manhattan,Apartment,95.0,Private room,10036.0,3.0,1,39,549,96.0


3. Cross tabulate `Room Type` and `Property Type`. What patterns do you see in what kinds of rentals are available? For which kinds of properties are private rooms more common than renting the entire property?

In [9]:
pd.crosstab(airbnb['Room Type'], airbnb['Property Type'])

Property Type,Apartment,Bed & Breakfast,Boat,Bungalow,Cabin,Camper/RV,Castle,Chalet,Condominium,Dorm,House,Hut,Lighthouse,Loft,Other,Tent,Townhouse,Treehouse,Villa
Room Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Entire home/apt,15669,13,7,4,1,6,0,0,72,4,752,0,1,392,14,0,83,0,4
Private room,10748,155,1,0,1,1,1,1,22,16,1258,2,0,312,29,4,52,1,4
Shared room,685,12,0,0,0,0,0,0,0,11,80,0,0,49,4,0,1,3,0


- Apartment is the most popular property Type, and among the apartments, entire home/apt is the most popular room type. Generally, not a lot of shared room rentals.
- Bed & Breakfast, House, Hut, Treehouse, and Other Property Type has more Private room rentals than the whole property.

----

4. For `Price`, make a histogram, kernel density, box plot, and a statistical description of the variable. Are the data badly scaled? Are there many outliers? Use `log` to transform price into a new variable, `price_log`, and take these steps again.

 5. Make a scatterplot of `price_log` and `Beds`. Describe what you see. Use `.groupby()` to compute a desciption of `Price` conditional on/grouped by the number of beds. Describe any patterns you see in the average price and standard deviation in prices.

6. Make a scatterplot of `price_log` and `Beds`, but color the graph by `Room Type` and `Property Type`. What patterns do you see? Compute a description of `Price` conditional on `Room Type` and `Property Type`. Which Room Type and Property Type have the highest prices on average? Which have the highest standard deviation? Does the mean or median appear to be a more reliable estimate of central tendency, and explain why?


---

**Q3.** This question looks at a time series of the number of active oil drilling rigs in the United States over time. The data comes from the Energy Information Agency.

1. Load `./data/drilling_rigs.csv` and examine the data. How many observations? How many variables? Are numeric variables correctly read in by Pandas, or will some variables have to be typecast/coerced? Explain clearly how these data need to be cleaned.
2. To convert the `Month` variable to an ordered datetime variable, use `df['time'] = pd.to_datetime(df['Month'], format='mixed')`.
3. Let's look at `Active Well Service Rig Count (Number of Rigs)`, which is the total number of rigs over time. Make a line plot of this time series. Describe what you see.
4. Instead of levels, we want to look at change over time. Compute the first difference of  `Active Well Service Rig Count (Number of Rigs)` and plot it over time. Describe what you see.
5. The first two columns are the number of onshore and offshore rigs, respectively. Melt these columns and plot the resulting series.