# EDA Project: King County Housing Data

![](https://kingcounty.gov/~/media/services/home-property/historic-preservation/images/KirklandHistoricHome.jpeg)

Source: [kingcounty.gov](https://kingcounty.gov/services/home-property/historic-preservation/projects/kirkland-inventory.aspx)

**Authored by: Su Wong**


### Stakeholder: Zachary Brooks, Seller

Stakeholder profile: Invests in historical houses, best neighborhoods, high profits, best timing within a year, should renovate?

### Introduction

The goal in this project was to come up with a few potential homes that fit the requirements of Zachary Brooks. He invests in historical homes in the best neighborhoods, seeking to make high profits within a time frame of a year. He wants to know if the houses he purchases should be renovated to increase his final resale price.

### Assumptions

We assume that historical homes are at least 50 years old.

### Questions:
- How do we define the best neighborhoods? High house prices, large houses, large lots or possibly houses that are similar in size to your neighbors?
- Best timing within a year: How can he increase his profits within a year? Should he renovate the houses that he buys? What kind of renovations can be achieved within a year?
- How are condition, grade, sqrt, bedrooms, bathrooms, total floors, location, etc. related to the final sale price of the house?
- What factors from this list can we improve on within a year to increase profits? 
- Investigate the relationship between the price increase when increasing the condition and grade.

### Hypothesis:

- The better the condition of the house, the higher the price
- The better the grade of the house, the higher the price
- The bigger the house, the closer in sqrt_living15 to sqrt_livingsquare, the higher the price
- The better the neighborhood the house is in, the higher the price

### Column Names and descriptions for the King County Data Set

- **id** - unique identified for a house
- **dateDate** - house was sold
- **pricePrice** - is prediction target
- **bedroomsNumber** - # of bedrooms
- **bathroomsNumber** - # of bathrooms
- **sqft_livingsquare** - footage of the home
- **sqft_lotsquare** - footage of the lot
- **floorsTotal** - floors (levels) in house
- **waterfront** - House which has a view to a waterfront
- **view** - Has been viewed
- **condition** - How good the condition is ( Overall )
- **grade** - overall grade given to the housing unit, based on King County grading system
- **sqft_above** - square footage of house apart from basement
- **sqft_basement** - square footage of the basement
- **yr_built** - Built Year
- **yr_renovated** - Year when house was renovated
- **zipcode** - zip
- **lat** - Latitude coordinate
- **long** - Longitude coordinate
- **sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors
- **sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors

### Importing packages

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from matplotlib.ticker import PercentFormatter
plt.rcParams.update({ "figure.figsize" : (8, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"
pd.plotting.register_matplotlib_converters()
pd.set_option('display.float_format', lambda x: '%.3f' % x)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Understanding the data

### Reading the data

In [None]:
# Read the data
df = pd.read_csv('data/King_County_House_prices_dataset.csv')

### Checking the dataset

In [None]:
# View the first 5 rows of the dataset
df.head()

In [None]:
# Getting the shape of the DataFrame (there are 21597 entries in the dataset)
df.shape

In [None]:
# Column names
df.columns

In [None]:
# Check for null values
df.info()

In [None]:
# Look at some descriptive statistics of the data:
df.describe()

### Dropping data

There is some data that may not be useful at all. id appears to be a unique identifying number for each sale record. 
It also appears that sqft_living = sqft_above + sqft_basement. These columns are dropped.

In [None]:
# Drop the id column
df.drop('id', axis=1, inplace=True)

# Drop the sqft_above column
df.drop('sqft_above', axis=1, inplace=True)

There also seems to be an extreme outlier in the number of bedrooms. A house with 33 bedrooms but only 1.75 bathrooms does not seem to be realistic. There may have been a mistake during the entry of the data, so we drop the row with this index number.

In [None]:
df.sort_values('bedrooms', ascending=False).head()
if 15856 in df.index:
    df.drop(15856, axis=0, inplace=True)

### Adding columns

The date column is converted to a datetime object. A month and year column is also added for future analysis of seasonal price trends.

In [None]:
df.date = pd.to_datetime(df.date)       # Convert date column to datetime object
df['year'] = df.date.dt.year            # Add year column
df['month'] = df.date.dt.strftime('%b') # Add month column

A hypothesis I have is that houses that have similar sqft_living to their neighbors indicates that they are in better neighborhoods due to more uniformity between houses. Correspondingly, houses in better neighborhoods lead to higher prices. I want to investigate the correlation between house size uniformity and price. A column was added to evaluate the absolute difference in sqft_living for a house and their next 15 neighbors:

In [None]:
df.eval('diff_sqft_living = abs(sqft_living - sqft_living15)', inplace=True)

### Dealing with null values

The total number of null values in each column is investigated:

In [None]:
df.isna().sum()

There are null values in waterfront, view and yr_renovated columns. Waterfront indicates whether a house has waterfront lot, with 1 meaning "yes" and 0 "no". A waterfront is typically a selling point of a house, therefore it would be advertised that a house has a waterfront. I think it is safe to assume that the missing values in the waterfront column can be replaced by 0. The waterfront values are also converted into a boolean. It is possible to convert the waterfront column into a categorical datatype using the pd.get_dummies() method, but it is not clear how important this variable is the requirements set out by our stakeholder at this point.

In [None]:
df["waterfront"].fillna(0, inplace = True)
df.waterfront = df.waterfront.astype(bool)

As for the view column, its meaning is still up for debate. There are a couple different interpretations of this column. The first one is that view represents a rating of how good the view from the house is. Another interpretation is that it represents how many times a house was viewed before it was sold. The NaN values are replaced with zeros and converted to int datatypes.

In [None]:
df.view.fillna(0, inplace = True)
df.view = df.view.astype(int)

Year renovated: we assume a null value means that it has not been renovated and convert the type to int64

In [None]:
df.yr_renovated.unique()
df.yr_renovated.fillna(0, inplace = True)
df.yr_renovated = df.yr_renovated.astype(int)

sqft_basement is a string object, which is not expected. Examining the unique values of sqft_basement, some '?' are found in the data. The '?'s are converted to '0's and then subsequently to int datatypes.

In [None]:
df.sqft_basement.unique()
df.sqft_basement.replace('?', '0', inplace=True)    # Replace '?' in column with '0'
df.sqft_basement = df.sqft_basement.astype(float)   # Convert string to float
df.sqft_basement = df.sqft_basement.astype(int)     # Convert float to int

Some additional checks were made:

In [None]:
# Check for unique values 
unique_vals_list = []
for col in df.columns:
    unique_vals_list.append({'column': col, 'unique values': len(df[col].unique())})
pd.DataFrame(unique_vals_list)

# Check to see if there are any duplicate rows
duplicate_rows = df.duplicated()
duplicate_rows.sum()

The dataset should have no null values anymore now:

In [None]:
df.isna().sum()
df.info()

## Data Analysis and Exploration

Finally, we can explore the data! We plot the histogram for all columns to examine the distribution of all the values in these columns:

In [None]:
df.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

A histogram of the price column is examined in more detail below. It appears that there are outliers which heavily skew the data.

In [None]:
sns.histplot(data = df,x = 'price', bins=100)
plt.xlabel('Price ($)', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.show();

We now look at the distribution of house prices using a box plot. Based on the seaborn.boxplot docs, whis=1.5 which means that the whiskers are at Q3 + 1.5*IQR and at Q3 - 1.5*IQR. The lower whisker in the plot below however does not seem to correspond to LOWER_QUARTILE - IQR*1.5. 

In [None]:
q1 = df["price"].quantile(0.25)
q3 =df["price"].quantile(0.75)
iqr = q3 - q1
limit_lower = q1 - 1.5*iqr
limit_upper = q3 + 1.5*iqr

fig, ax=plt.subplots(figsize=(10,3))
ax = sns.boxplot(x=df.price)
plt.scatter(limit_lower,0, color='red', marker='x', s=150)
plt.scatter(limit_upper,0, color='red', marker='x', s=150)
plt.show();

However for the outlier values are on the upper end of the prices so limiting the dataset to below Q3 + 1.5*IQR is reasonable. For our client I am more interested in affordable homes that have room for more profit. 

In [None]:
# Plotting a histogram of prices below Q3+1.5IQR
df_lower = df[df.price < limit_upper]
fig, ax=plt.subplots(figsize=(10,6))
sns.histplot(data = df_lower, x='price', hue='grade', bins=100, palette="bright", alpha=0.5)
plt.show();

Counting the number of grades and conditions, for house prices below Q3+1.5*IQR, the most common grade  is 7 and the most common condition is 3.

In [None]:
df_lower.value_counts('grade').reset_index(name='count').sort_values('grade', ascending=True).reset_index(drop=True)
df_lower.value_counts('condition').reset_index(name='count').sort_values('condition', ascending=True).reset_index(drop=True)

Let's also look at the distribution of the house prices above Q3+1.5IQR.

In [None]:
df_upper = df[df.price >= limit_upper]
fig, ax=plt.subplots(figsize=(10,6))
sns.histplot(data = df_upper, x = 'price', hue='grade', bins=50, palette="bright", alpha=0.5)
plt.show();

Now doing the same thing for the distribution of the house prices above Q3+1.5*IQR, the most common grade the outlier data is 10 and the most common condition is 3.

In [None]:
df_upper.value_counts('grade').reset_index(name='count').sort_values('grade', ascending=True).reset_index(drop=True)
df_upper.value_counts('condition').reset_index(name='count').sort_values('condition', ascending=True).reset_index(drop=True)

Alternatively, I also tried filtering the house prices for the 90th percentile. It is a simpler approach than having to calculate Q3 + 1.5*IQR. Since we will probably only work with house prices around the median value or less, this cutoff is justified.

In [None]:
df_90th = df[df.price < df['price'].quantile(0.90)]
sns.histplot(data=df_90th, x="price", kde=True)
plt.show();

I also took a look at the distribution of house prices for renovated vs unrenovated houses. The mean price for renovated houses was higher than for unrenovated houses.

In [None]:
# Distribution in prices for renovated vs not renovated
df_norenov = df_90th[df_90th['yr_renovated'] == 0]
df_renov = df_90th[df_90th['yr_renovated'] != 0]
print('Unrenovated mean price: ' + str(df_norenov['price'].mean()))
print('Renovated mean price: ' + str(df_renov['price'].mean()))

sns.histplot(data=df_norenov, x="price", kde=True, color="red")
sns.histplot(data=df_renov, x="price", kde=True, color="blue")
plt.show();

## Testing hypotheses

1. The higher the condition, the higher the price
2. The higher the grade, the higher the price
3. Houses of similar size to their neighbors have higher prices

I am interested to know how condition and grade increases the house price. First, the prices from the whole data set is plotted as a scatter plot as a function of condition and grade. It can be observed that there are large variations in the price for the same condition and grade. The goal is to investigate for the client, how we can sell his houses for prices that are on the upper end of the scale, given a fixed condition or grade.

In [None]:
fig, ax=plt.subplots()
sns.set_theme(style="white")
sns.scatterplot(data = df, x = "condition", y = "price", hue = "condition", s=150, palette="colorblind")
plt.title('House prices by condition', fontsize=16)
plt.xlabel('Condition', fontsize=16)
plt.ylabel('Price ($)', fontsize=16)
plt.xticks(range(1,6), fontsize=16)
ax.set_yticklabels(['{:,.0f}'.format(x) + 'M' for x in ax.get_yticks()/1000000],fontsize=15)
plt.legend([],[], frameon=False)
plt.show();

fig, ax=plt.subplots()
sns.scatterplot(data = df, x = "grade", y = "price", hue = "grade", s=150, palette="colorblind")
plt.title('House prices by grade', fontsize=16)
plt.xlabel('Grade', fontsize=16)
plt.ylabel('Price ($)', fontsize=16)
plt.legend([],[], frameon=False)
plt.xticks(range(1,14), fontsize=16)
ax.set_yticklabels(['{:,.0f}'.format(x) + 'M' for x in ax.get_yticks()/1000000],fontsize=15)
plt.show();

We examine below the house price distribution for the 90th percentile of house prices. It is observed that for a given house condition or house grade, there is a large variability in the house prices. For example, for condition = 3, the house price ranges from 100k to 900k.

In [None]:
fig, ax=plt.subplots(figsize=(7.5,6))
sns.set_theme(style="white")
sns.scatterplot(data = df_90th, x = "condition", y = "price", hue = "condition", s=150, palette="colorblind")
plt.title('House prices by condition', fontsize=20)
plt.xlabel('Condition', fontsize=16)
plt.ylabel('Price ($)', fontsize=16)
plt.xticks(range(1,6), fontsize=16)
ax.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000],fontsize=15)
plt.legend([],[], frameon=False)
plt.show();

fig, ax=plt.subplots(figsize=(7.5,6))
sns.scatterplot(data = df_90th, x = "grade", y = "price", hue = "grade", s=150, palette="colorblind")
plt.title('House prices by grade', fontsize=20)
plt.xlabel('Grade', fontsize=16)
plt.ylabel('Price ($)', fontsize=16)
plt.legend([],[], frameon=False)
plt.xticks(range(1,14), fontsize=16)
ax.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000],fontsize=15)
plt.show();

Now looking at a box plot for condition vs price and grade vs price. It appears that a big increase in median house prices can be observed going from condition 2 to 3. In general, the median house price is also increasing with condition.

In [None]:
sns.boxplot( y=df_90th["price"] , x=df_90th["condition"]);
plt.show();

sns.boxplot( y=df_90th["price"] , x=df_90th["grade"]);
plt.show();

Instead of presenting boxplots to our stakeholder, I chose to produce the median, 25th percentile and 75th percentile plots of the house price vs condition and grade below:

In [None]:
cond_price = df_90th.groupby('condition')['price'].mean().reset_index()
cond_price.rename(columns={'price':'mean_price'}, inplace=True)
cond_price['median'] = df_90th.groupby('condition')['price'].median().reset_index()['price']
cond_price['quantile0.75'] = df_90th.groupby('condition')['price'].quantile(0.75).reset_index()['price']
cond_price['quantile0.25'] = df_90th.groupby('condition')['price'].quantile(0.25).reset_index()['price']

grade_price = df_90th.groupby('grade')['price'].mean().reset_index()
grade_price.rename(columns={'price':'mean_price'}, inplace=True)
grade_price['median'] = df_90th.groupby('grade')['price'].median().reset_index()['price']
grade_price['quantile0.75'] = df_90th.groupby('grade')['price'].quantile(0.75).reset_index()['price']
grade_price['quantile0.25'] = df_90th.groupby('grade')['price'].quantile(0.25).reset_index()['price']

fig, ax=plt.subplots(figsize=(7.5,6))
#ax.plot(cond_price['condition'], cond_price["mean_price"], marker='o')
ax.plot(cond_price['condition'], cond_price['median'], marker='o', label="Median")
ax.plot(cond_price['condition'], cond_price["quantile0.25"], marker='o', label="25th percentile")
ax.plot(cond_price['condition'], cond_price["quantile0.75"], marker='o', label="75th percentile")
plt.title('House price by condition', fontsize=20)
plt.xlabel('Condition', fontsize=16)
plt.ylabel('Price ($)', fontsize=16)
plt.legend(loc=2, prop={'size': 16})
plt.xticks(range(1,6), fontsize=16)
plt.yticks(list(np.array(range(0,8))*100000),['0', '100K', '200K', '300K', '400K', '500K','600K','700K'], fontsize=16)
plt.show();

fig, ax=plt.subplots(figsize=(7.5,6))
#ax.plot(grade_price['grade'], grade_price["mean_price"], marker='o')
ax.plot(grade_price['grade'], grade_price["median"], marker='o', label="Median")
ax.plot(grade_price['grade'], grade_price["quantile0.25"], marker='o', label="25th percentile")
ax.plot(grade_price['grade'], grade_price["quantile0.75"], marker='o', label="75th percentile")
plt.title('House price by grade', fontsize=20)
plt.xlabel('Grade', fontsize=16)
plt.ylabel('Price ($)', fontsize=16)
plt.legend(loc=2, prop={'size': 16})
plt.xticks(range(1,14), fontsize=16)
ax.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000], fontsize=15)
plt.show();

The plots above indicate that the median house price increases significantly from condition 2 to 3. There is also an increase in house price with grade observed. Seeing the plots above, made me interested in finding out the price increase by upgrading from current next grade/condition to the one above. It can be observed that the biggest change in median price can be seen when you go from condition 6 to 7 or grade 2 to 3. 

In [None]:
# Calculate difference in price and percentage change to next condition 
for idx in range(4):
    cond_price.loc[idx,'diff']=(cond_price.loc[idx+1,'median']-cond_price.loc[idx,'median'])
    cond_price.loc[idx,'percent_diff']=100.0*(cond_price.loc[idx+1,'median']-cond_price.loc[idx,'median'])/cond_price.loc[idx,'median']
    cond_price.loc[4,'percent_diff'] = float("NaN")

fig, ax=plt.subplots(figsize=(7.5,6))
plt.plot(cond_price['condition'], cond_price['percent_diff'], linestyle='--', marker='o', color="red")
plt.title('% change in house price to next condition', fontsize=20)
plt.xticks(range(1,5),fontsize=16)
ax.set_xticklabels(['1 to 2', '2 to 3', '3 to 4', '4 to 5'])
plt.yticks(fontsize=16)
plt.xlabel('Increase in condition', fontsize=16)
plt.ylabel('% Change', fontsize=16)
plt.show();

# Calculate difference in price and percentage change to next grade 
for idx in range(9):
    grade_price.loc[idx,'diff']=grade_price.loc[idx+1,'median']-grade_price.loc[idx,'median']
    grade_price.loc[idx,'percent_diff']=100.0*(grade_price.loc[idx+1,'median']-grade_price.loc[idx,'median'])/grade_price.loc[idx,'median']
    grade_price.loc[9,'percent_diff'] = float("NaN")

fig, ax=plt.subplots(figsize=(7.5,6))
plt.plot(grade_price['grade'], grade_price['percent_diff'], linestyle='--', marker='o', color="blue")
plt.title('% change in house price to next grade', fontsize=20)
plt.xlim((1,13))
plt.xticks(range(1,13), ['1 to 2', '2 to 3', '3 to 4', '4 to 5', '5 to 6', '6 to 7', '7 to 8', '8 to 9', '9 to 10', '10 to 11', '11 to 12', '12 to 13'], fontsize=16, rotation=90)
plt.xlabel('Increase in grade', fontsize=16)
plt.ylabel('% Change', fontsize=16)
plt.yticks(fontsize=16)
plt.show();

I did some calculations to see how the average % profit can be increased for different combinations of conditions and grade. In general, an average increase in the median price of about 50% can be achieved by upgrading the house from condition 2 to 3 and from grade 6 to 7.

In [None]:
# Average increase from condition 2 to 3 at grade 6
df26 = df_90th[(df_90th['condition'] == 2) & (df_90th['grade'] == 6)]
df36 = df_90th[(df_90th['condition'] == 3) & (df_90th['grade'] == 6)]
100*(df36.price.mean()-df26.price.mean())/df26.price.mean()
100*(df36.price.median()-df26.price.median())/df26.price.median()

# # Average increase from condition 2 to 3 at grade 7
# df27 = df_90th[(df_90th['condition'] == 2) & (df_90th['grade'] == 7)]
# df37 = df_90th[(df_90th['condition'] == 3) & (df_90th['grade'] == 7)]
# 100*(df37.price.mean()-df27.price.mean())/df27.price.mean()
# 100*(df37.price.median()-df27.price.median())/df27.price.median()

# # Average increase from condition 2 to 3 at grade 8
# df28 = df_90th[(df_90th['condition'] == 2) & (df_90th['grade'] == 8)]
# df38 = df_90th[(df_90th['condition'] == 3) & (df_90th['grade'] == 8)]
# 100*(df38.price.mean()-df28.price.mean())/df28.price.mean()

# Average increase from condition 2 to 3 and grade 6 to 7
df1 = df_90th[(df_90th['condition'] == 2) & (df_90th['grade'] == 6)]
df2 = df_90th[(df_90th['condition'] == 3) & (df_90th['grade'] == 7)]
100*(df2.price.mean()-df1.price.mean())/df1.price.mean()
100*(df2.price.median()-df1.price.median())/df1.price.median()

# Average increase from condition 2 to 3 and grade 7 to 8
df1 = df_90th[(df_90th['condition'] == 2) & (df_90th['grade'] == 7)]
df2 = df_90th[(df_90th['condition'] == 3) & (df_90th['grade'] == 8)]
100*(df2.price.mean()-df1.price.mean())/df1.price.mean()
100*(df2.price.median()-df1.price.median())/df1.price.median()




## Filtering for potential houses for our stakeholder

Plotting on a map all the 90th percentile house prices, we observe that the areas with the highest house prices were in Sammamish, Redmond and Bellevue, as ranked by niche.com

In [None]:
# Plotting the house prices below the 90th percentile
fig = px.scatter_mapbox(df_90th, lat="lat", lon="long", color="price", hover_name="zipcode", center=dict(lat=47.48, lon=-122.09), \
    opacity=1, zoom=8.4, width=600, height=600)
fig.update_layout(mapbox_style="open-street-map")

I filtered the data based on these conditions:
1. Houses built in or before 1972
2. Houses with condition of 2
3. Houses with a grade of 6 or 7
4. Houses where the difference between sqft_living and sqft_living15 was less than 100 sq ft.
5. Houses with zip codes belonging to this list of top areas in King County: [Niche](https://www.niche.com/places-to-live/search/best-places-to-live/c/king-county-wa/)

The results where then sorted for the 5 cheapest houses based on that criteria. The goal was to find the cheapest houses for most profit for a condition of 2 and grade or 6 or 7.

In [None]:
df_cond = df_90th[(abs(df_90th['diff_sqft_living']) < 100) & (df_90th['yr_built'] <= 1972) & (df['condition']== 2) & ((df['grade'] == 6) | (df['grade'] == 7))]

nice_zip_codes = [98029, 98074, 98075, 98052, 98004, 98005, 98006, 98007, 98008, 98009, 98015, 98052, 98053, 98007, \
                98073, 98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112, 98115, 98116, 98117, \
                98118, 98119, 98121, 98122, 98125, 98126, 98133, 98134, 98136, 98144, 98146, 98154, 98164, 98174, \
                98177, 98178, 98195, 98199]

# Top five candidates by price
df_final = df_cond[df_cond['zipcode'].isin(nice_zip_codes)].sort_values('price',ascending=True).head(5)
df_final[['grade','zipcode','price','yr_built','bedrooms','bathrooms','sqft_living','sqft_lot']]

### Top 5 candidates for our stakeholder

We want the houses that give us the most value for money, given condition = 2 and grade = 6 or 7. Plotting the top five candidate houses by price:

In [None]:
fig = px.scatter_mapbox(df_final, lat="lat", lon="long", color="price", size="price", hover_name="zipcode", opacity=1, zoom=8, width=600, height=600)
fig.update_layout(mapbox_style="open-street-map")
fig.show();

## Top 2 recommendations for the stakeholder

Using google maps street view, the top 5 candidates were narrowed down some more to these final two:

In [None]:
df_final.loc[[242,702]]

### Investigating correlations in the data:

Plotting a scatter plot of the diff_sqft_living vs price, it appears there is no obvious correlation between the two columns.

In [None]:
sns.scatterplot(x="diff_sqft_living", y="price", data=df_90th)
plt.show();

Building a correlation matrix:

In [None]:
df_corr = df_90th.copy()
df_corr.drop(['lat','long'], axis=1, inplace=True)
mask = np.triu(np.ones_like(df_corr.corr(), dtype=np.bool))
corr = df_corr.corr().abs()
fig, ax=plt.subplots(figsize=(17,12))
fig.suptitle('Variable Correlations', fontsize=30, y=.95, fontname='Silom')
heatmap = sns.heatmap(corr, cmap='Reds', mask=mask, annot=True)
plt.show();