### **Data Analyzing and Manipulating**

We will use the "House Sales in King County, USA" dataset from Kaggle.
<br>Kaggle dataset link:
<br>https://www.kaggle.com/datasets/harlfoxem/housesalesprediction

In [1]:
import pandas as pd
from IPython.display import display

In [2]:
df_houses = pd.read_csv("kc_house_data _rev.csv")

#### **Setting a column as the index**

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_houses.head())

Setting the "id" column as index:

In [None]:
new_df_index = df_houses.set_index("yr_built")

new_df_index.head()

To remove the index:

In [None]:
new_data = new_df_index.reset_index()

new_data.head()

Indexing makes subsetting simpler:

In [None]:
year_built_bool = df_houses["yr_built"].isin([1955, 1987])
year_built_55_82 = df_houses[year_built_bool]

year_built_55_82

Using indexing is simpler:

In [None]:
# You can create a new dataframe with different indexes:
new_df_index = df_houses.set_index("yr_built")

new_df_index.loc[[1955, 1987]]

If you noticed, index values don't need to be unique.

In [None]:
# Also, we can use the "sort_index()" method to sort the index:
sort_new_df = new_df_index.sort_index()

sort_new_df

Note: it is recommended when setting a column as index to drop the missing items (NaN) in this column because it may raise exceptions, especially in multi-level indexing:

In [None]:
# You can subset a single value:
sort_new_df.loc[1955:1956]

#### **Multi-level indexing:**

In [None]:
# To work with multi-level, first we will remove the "NaN" values from the data:
# To remove "NaN" values, we use the "dropna()" method:
df_houses_copy = df_houses.copy().dropna()

In [None]:
mutlti_index = df_houses_copy.set_index(["yr_built", "grade"])

mutlti_index.head()

In [None]:
# Subsetting multi-level indexing:
mutlti_index.loc[[1933, 1995]]

In [None]:
# Subsetting multi-level with inner level indexes:
mutlti_index.loc[[(1933, 8), (1955, 9)]]

Also, we can use the "sort_index()" method with multi-level indexing:

In [None]:
sort_df_index = mutlti_index.sort_index()
# sort_df_index = mutlti_index.sort_index(ascending=False)

sort_df_index

In [None]:
# To Slice multi-level indexing, the dataframe should be sorted; otherwise, "UnsortedIndexError" will be raised:

sort_df_index.loc[1910:1915]

In [None]:
# To slice the data between:
# years built: 1910 and 1915
# grade: 5 and 9
sort_df_index.loc[(1910, 5):(1915, 9)]

In [None]:
# Slicing with specific columns:
sort_df_index.loc[(1910, 5):(1915, 9), "id":"price"]

#### **Manipulating Datetime data:**

In [3]:
display(df_houses.head())

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180.0,5650.0,1.0,0.0,0.0,...,7,1180.0,0.0,1955.0,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570.0,7242.0,2.0,0.0,0.0,...,7,2170.0,400.0,1951.0,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770.0,10000.0,1.0,0.0,0.0,...,6,770.0,0.0,1933.0,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960.0,5000.0,1.0,0.0,0.0,...,7,1050.0,910.0,1965.0,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680.0,8080.0,1.0,0.0,0.0,...,8,1680.0,0.0,1987.0,0,98074,47.6168,-122.045,1800,7503


As shown in the above DataFrame, there is a column called date:

In [4]:
# The "date" column dtype is an object:
df_houses["date"].dtypes

dtype('O')

It is an object type because the dates are stored as strings.
<br>To connvert it to datetime:

In [5]:
df_houses["date"] = pd.to_datetime(df_houses["date"])
df_houses["date"].dtypes

dtype('<M8[ns]')

In [6]:
# As shown below, using the to_datetime function, it converted the date column items to readable dates:
display(df_houses.head())

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,2014-10-13,221900.0,3,1.0,1180.0,5650.0,1.0,0.0,0.0,...,7,1180.0,0.0,1955.0,0,98178,47.5112,-122.257,1340,5650
1,6414100192,2014-12-09,538000.0,3,2.25,2570.0,7242.0,2.0,0.0,0.0,...,7,2170.0,400.0,1951.0,1991,98125,47.721,-122.319,1690,7639
2,5631500400,2015-02-25,180000.0,2,1.0,770.0,10000.0,1.0,0.0,0.0,...,6,770.0,0.0,1933.0,0,98028,47.7379,-122.233,2720,8062
3,2487200875,2014-12-09,604000.0,4,3.0,1960.0,5000.0,1.0,0.0,0.0,...,7,1050.0,910.0,1965.0,0,98136,47.5208,-122.393,1360,5000
4,1954400510,2015-02-18,510000.0,3,2.0,1680.0,8080.0,1.0,0.0,0.0,...,8,1680.0,0.0,1987.0,0,98074,47.6168,-122.045,1800,7503


"M8[ns]" is a specific NumPy data type that represents datetime64 values.
<br>To get the day, month, and year:

In [7]:
df_houses["year"] = df_houses["date"].dt.year
df_houses["month"] = df_houses["date"].dt.month
df_houses["day"] = df_houses["date"].dt.day
df_houses["weekday"] = df_houses["date"].dt.day_name()
df_houses["time"] = df_houses["date"].dt.time

display(df_houses.head())

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,zipcode,lat,long,sqft_living15,sqft_lot15,year,month,day,weekday,time
0,7129300520,2014-10-13,221900.0,3,1.0,1180.0,5650.0,1.0,0.0,0.0,...,98178,47.5112,-122.257,1340,5650,2014,10,13,Monday,00:00:00
1,6414100192,2014-12-09,538000.0,3,2.25,2570.0,7242.0,2.0,0.0,0.0,...,98125,47.721,-122.319,1690,7639,2014,12,9,Tuesday,00:00:00
2,5631500400,2015-02-25,180000.0,2,1.0,770.0,10000.0,1.0,0.0,0.0,...,98028,47.7379,-122.233,2720,8062,2015,2,25,Wednesday,00:00:00
3,2487200875,2014-12-09,604000.0,4,3.0,1960.0,5000.0,1.0,0.0,0.0,...,98136,47.5208,-122.393,1360,5000,2014,12,9,Tuesday,00:00:00
4,1954400510,2015-02-18,510000.0,3,2.0,1680.0,8080.0,1.0,0.0,0.0,...,98074,47.6168,-122.045,1800,7503,2015,2,18,Wednesday,00:00:00


To get the hour, minute, and second:
- **Hour:** df_houses["date"].dt.hour
- **Minute:** df_houses["date"].dt.minute
- **Second:** df_houses["date"].dt.second

In [8]:
df_houses_date_filter = df_houses[df_houses["date"] > "2014-12-01"]
display(df_houses_date_filter)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,zipcode,lat,long,sqft_living15,sqft_lot15,year,month,day,weekday,time
1,6414100192,2014-12-09,538000.0,3,2.25,2570.0,7242.0,2.0,0.0,0.0,...,98125,47.7210,-122.319,1690,7639,2014,12,9,Tuesday,00:00:00
2,5631500400,2015-02-25,180000.0,2,1.00,770.0,10000.0,1.0,0.0,0.0,...,98028,47.7379,-122.233,2720,8062,2015,2,25,Wednesday,00:00:00
3,2487200875,2014-12-09,604000.0,4,3.00,1960.0,5000.0,1.0,0.0,0.0,...,98136,47.5208,-122.393,1360,5000,2014,12,9,Tuesday,00:00:00
4,1954400510,2015-02-18,510000.0,3,2.00,1680.0,8080.0,1.0,0.0,0.0,...,98074,47.6168,-122.045,1800,7503,2015,2,18,Wednesday,00:00:00
7,2008000270,2015-01-15,291850.0,3,1.50,1060.0,9711.0,1.0,0.0,0.0,...,98198,47.4095,-122.315,1650,9711,2015,1,15,Thursday,00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21604,9834201367,2015-01-26,429000.0,3,2.00,1490.0,1126.0,3.0,0.0,0.0,...,98144,47.5699,-122.288,1400,1230,2015,1,26,Monday,00:00:00
21606,7936000429,2015-03-26,1010000.0,4,3.50,3510.0,7200.0,2.0,0.0,0.0,...,98136,47.5537,-122.398,2050,6200,2015,3,26,Thursday,00:00:00
21607,2997800021,2015-02-19,475000.0,3,2.50,1310.0,1294.0,2.0,0.0,0.0,...,98116,47.5773,-122.409,1330,1265,2015,2,19,Thursday,00:00:00
21609,6600060120,2015-02-23,400000.0,4,2.50,2310.0,5813.0,2.0,0.0,0.0,...,98146,47.5107,-122.362,1830,7200,2015,2,23,Monday,00:00:00


In [9]:
# Calculate the difference between two dates:
df_houses["days_from_today"] = pd.to_datetime("today") - df_houses["date"]
display(df_houses.head())

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,lat,long,sqft_living15,sqft_lot15,year,month,day,weekday,time,days_from_today
0,7129300520,2014-10-13,221900.0,3,1.0,1180.0,5650.0,1.0,0.0,0.0,...,47.5112,-122.257,1340,5650,2014,10,13,Monday,00:00:00,3919 days 21:26:11.936637
1,6414100192,2014-12-09,538000.0,3,2.25,2570.0,7242.0,2.0,0.0,0.0,...,47.721,-122.319,1690,7639,2014,12,9,Tuesday,00:00:00,3862 days 21:26:11.936637
2,5631500400,2015-02-25,180000.0,2,1.0,770.0,10000.0,1.0,0.0,0.0,...,47.7379,-122.233,2720,8062,2015,2,25,Wednesday,00:00:00,3784 days 21:26:11.936637
3,2487200875,2014-12-09,604000.0,4,3.0,1960.0,5000.0,1.0,0.0,0.0,...,47.5208,-122.393,1360,5000,2014,12,9,Tuesday,00:00:00,3862 days 21:26:11.936637
4,1954400510,2015-02-18,510000.0,3,2.0,1680.0,8080.0,1.0,0.0,0.0,...,47.6168,-122.045,1800,7503,2015,2,18,Wednesday,00:00:00,3791 days 21:26:11.936637


In [10]:
# Calculate the difference between two dates:
df_houses["days_from"] = pd.to_datetime("2015-05-01") - df_houses["date"]
display(df_houses.head())

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,long,sqft_living15,sqft_lot15,year,month,day,weekday,time,days_from_today,days_from
0,7129300520,2014-10-13,221900.0,3,1.0,1180.0,5650.0,1.0,0.0,0.0,...,-122.257,1340,5650,2014,10,13,Monday,00:00:00,3919 days 21:26:11.936637,200 days
1,6414100192,2014-12-09,538000.0,3,2.25,2570.0,7242.0,2.0,0.0,0.0,...,-122.319,1690,7639,2014,12,9,Tuesday,00:00:00,3862 days 21:26:11.936637,143 days
2,5631500400,2015-02-25,180000.0,2,1.0,770.0,10000.0,1.0,0.0,0.0,...,-122.233,2720,8062,2015,2,25,Wednesday,00:00:00,3784 days 21:26:11.936637,65 days
3,2487200875,2014-12-09,604000.0,4,3.0,1960.0,5000.0,1.0,0.0,0.0,...,-122.393,1360,5000,2014,12,9,Tuesday,00:00:00,3862 days 21:26:11.936637,143 days
4,1954400510,2015-02-18,510000.0,3,2.0,1680.0,8080.0,1.0,0.0,0.0,...,-122.045,1800,7503,2015,2,18,Wednesday,00:00:00,3791 days 21:26:11.936637,72 days
