# Week2_2_Data Management_1

In this lab, we begin with the introduction of the Pandas for data management. You will be manipulating a new type of data called *DataFrame* which is the spreadsheet in Python. The dataset for this lab comes from a resale housing sample in Beijing from 2012 to 2016. 

Specifically. we will cover the following contents in this lab:


<p><a id="table"> </a></p>
<h3 id="Table-of-Contents">Table of Contents<a class="anchor-link" href="#Table-of-Contents">¶</a></h3>
<ol>
<h3>1 <a href="#pythonpackages">Python Packages</a></h3>
</ol>
<ol>
<h3>2 <a href="#pandasbasics">Pandas basics</a></h3>
2.1 <a href="#seriesanddataFrame">Series and DataFrame</a><br/>
2.2 <a href="#importandexportdata">Import and export data</a><br/>  
2.3 <a href="#exploringourdata">Exploring our data</a><br/>
2.4 <a href="#indexingandslicingofdataFrame">Indexing and Slicing of DataFrame</a><br/>   
2.5 <a href="#filtering">Filtering</a><br/>
2.6 <a href="#modifyingelements">Modifying elements</a><br/>
2.7 <a href="#pandasarithmetics">Pandas Arithmetics</a><br/>
2.8 <a href="#groupby">Groupby function</a><br/>    
</ol>



<p><a id="Python packages"> </a></p>
<h2 id="1-Python-packages">1 <a href="#table">Python packages</a><a class="anchor-link" href="#1-python-packages">¶</a></h2>

**Built-in Packages (or called The Python Standard Library)**: Anaconda comes with many Python built-in packages (also referred to as *modules* or *libraries*) that offer ready-to-use solutions to common programming problems. These packages have already been installed with Anaconda, and can be accessed using the *import* keyword. 

For today's lab, Pandas can be imported by typing `import pandas as pd` in your current script.

Here we load this package by `import` and give it a name "pd" using `as`, so that we do not need to type "pandas" each time we use it.

In [6]:
# import pandas and named it "pd" in the current python script
import pandas as pd

**Other Packages**: 
- Some packages are not installed with Anaconda, and We need to *install* them before importing to current script. For example, geopandas.

- **A Python virtual environment** is a tool for dependency management and project isolation. It allows packages to be installed locally in an isolated directory for a particular project, as opposed to being installed globally (i.e. as part of a system-wide Python).


<p><a id="pandasbasics"> </a></p>
<h2 id="2pandasbasics">2 <a href="#table">Pandas basics</a><a class="anchor-link" href="#2Pandasbasics">¶</a></h2>

- [Pandas](http://pandas.pydata.org/) is a widely used Python library for data analysis. 

### Easy-to-use data structures
- In pandas, the data is typically stored in a data structure called a DataFrame that looks like a typical table with rows and columns (+indices and column names), where columns can contain data of different data types. Thus, it is similar in some sense to how data is stored in Excel. Using this structure, we can use operations like arithmetic, columns and rows selection, columns and rows addition etc.


### Combines functionalities from many Python modules
- Pandas is established based on another package numpy&mdash;which supports the multi-dimensional arrays and the mathematical functions for scientific computing. It is much more than an easier-to-use Numpy as it also combines many functionalities from other Python libraries such as [matplotlib (plotting)](https://matplotlib.org/) and [scipy(mathematics, science, engineering)](https://www.scipy.org/). Thus, you can use many of the features included in those packages without importing them at all.

<p><a id="seriesanddataFrame"> </a></p>
<h3 id="seriesanddataFrame">2.1 <a href="#table">Series and DataFrame</a><a class="anchor-link" href="#seriesanddataFrame">¶</a></h3>

The primary objects (Data types) in pandas are:

- ***Series***: a single column of data (one dimension). You can think of *Series* objects as fancier versions of *list*.


- ***DataFrame***: tabular (as in table) data consisting of a collection of Series in which each column is a Pandas series. It is the central data structure used in most analysis using the Pandas library.


- Series can be created through list using **pd.Series(list)**.

In [28]:
# series can be created through list or dictionary
# using list:
city_name = pd.Series(["New York", "Los Angeles", "Chicago"])
city_name

0       New York
1    Los Angeles
2        Chicago
dtype: object

<p><a id="importandexportdata"> </a></p>
<h3 id="importandexportdata">2.2 <a href="#table">Import and export data</a><a class="anchor-link" href="#importandexportdata">¶</a></h3>

- Pandas can read and write dataset in many data formats. To check whether pandas can read your data format, click [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). 


Let us take an example using the 2012 Beijing housing resale dataset. The dataset was collected from a private housing agency (Lianjia) and then was subsampled to 5,000 transaction records.  

Download the `HouseBeijing2012.xlsx` in today's folder. We are first going to read this Excel as DataFrame into Pandas.
   - function to use: `df = pd.read_excel('<file Path>/HouseBeijing2012.xlsx')`
   - A **file path** is a string that tells the computer where to find a file. 
   - If you have saved `HouseBeijing2012.xlsx` in the same folder as this Jupyter notebook, then all you will need to input as your path is `HouseBeijing2012.xlsx`. 

- To get where you create this Jupyter notebook, we can use the **os** package and the **os.getcwd()** to obtain the working path of the current Jupyter notebook.     

In [7]:
# get the current file path using os.getcwd function
import os
os.getcwd()

'C:\\Users\\Lwz12\\OneDrive\\桌面\\Teaching_CRP4680\\Week2'

In [8]:
# a relative file path
df_2012 = pd.read_excel("HouseBeijing2012.xlsx")
df_2012

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
0,BJFT84326414,1544,1400010.56,2012,2,1,1,69.68,bottom floor,0,...,https://bj.lianjia.com/xiaoqu/1111027377493/,0,2,633.24007,9,1803.02071,0,1,9345.20091,7396.31505
1,BJCP84958845,2606,1800066.00,2012,3,2,2,129.00,,0,...,https://bj.lianjia.com/xiaoqu/1111027380050/,0,0,2284.09390,9,9154.80958,0,0,18298.50637,18632.22305
2,BJDX84905788,2264,1350038.34,2012,2,1,1,88.83,3,1,...,https://bj.lianjia.com/xiaoqu/1111027379274/,0,1,667.21572,8,11158.05983,0,4,22480.82065,20105.06770
3,BJFT00386624,3621,1800006.91,2012,2,1,1,98.69,3,0,...,https://bj.lianjia.com/xiaoqu/1111027382765/,0,1,939.29061,9,1698.79101,0,10,16309.85203,11427.48851
4,BJCY84713854,1127,1970019.58,2012,1,1,1,53.66,4,0,...,https://bj.lianjia.com/xiaoqu/1111027376538/,0,3,476.28267,9,938.35742,2,0,8105.90581,7213.87518
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,BJTJ84718789,3360,1030038.24,2012,2,1,1,81.84,bottom floor,1,...,https://bj.lianjia.com/xiaoqu/1111027382053/,0,0,1669.04965,9,948.21718,1,0,15849.26185,21363.35490
4996,BJFT84287006,736,1300028.16,2012,1,1,1,58.12,3,1,...,https://bj.lianjia.com/xiaoqu/1111027375508/,0,1,770.89444,9,777.47225,1,1,6221.44255,6278.89885
4997,BJDC84781079,200,2190056.96,2012,2,1,1,58.24,3,1,...,https://bj.lianjia.com/xiaoqu/1111027374239/,2,1,715.50723,9,2053.92372,0,0,3779.07141,3505.28158
4998,BJSJ85075781,2133,1500007.54,2012,1,1,1,63.86,2,1,...,https://bj.lianjia.com/xiaoqu/1111027378940/,0,1,863.10517,9,413.21398,2,2,21094.05154,15576.85068


Here, `HouseBeijing2012.xlsx` is our file path. 

- The file path above is a **relative path**, meaning it is relative to the location of this notebook. If you move this notebook to a different folder, the relative path will no longer work. 

- An **absolute path** is a path that is not relative to the location of the notebook. An absolute path will work no matter where the notebook is located

In [32]:
# path of the course folder
path = 'C:\\Users\\Lwz12\\OneDrive\\桌面\\Teaching_CRP4680\\Week2'

In [35]:
# an absolute file path
df_2012 = pd.read_excel(path + "\\HouseBeijing2012.xlsx")
df_2012

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
0,BJFT84326414,1544,1400010.56,2012,2,1,1,69.68,bottom floor,0,...,https://bj.lianjia.com/xiaoqu/1111027377493/,0,2,633.24007,9,1803.02071,0,1,9345.20091,7396.31505
1,BJCP84958845,2606,1800066.00,2012,3,2,2,129.00,,0,...,https://bj.lianjia.com/xiaoqu/1111027380050/,0,0,2284.09390,9,9154.80958,0,0,18298.50637,18632.22305
2,BJDX84905788,2264,1350038.34,2012,2,1,1,88.83,3,1,...,https://bj.lianjia.com/xiaoqu/1111027379274/,0,1,667.21572,8,11158.05983,0,4,22480.82065,20105.06770
3,BJFT00386624,3621,1800006.91,2012,2,1,1,98.69,3,0,...,https://bj.lianjia.com/xiaoqu/1111027382765/,0,1,939.29061,9,1698.79101,0,10,16309.85203,11427.48851
4,BJCY84713854,1127,1970019.58,2012,1,1,1,53.66,4,0,...,https://bj.lianjia.com/xiaoqu/1111027376538/,0,3,476.28267,9,938.35742,2,0,8105.90581,7213.87518
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,BJTJ84718789,3360,1030038.24,2012,2,1,1,81.84,bottom floor,1,...,https://bj.lianjia.com/xiaoqu/1111027382053/,0,0,1669.04965,9,948.21718,1,0,15849.26185,21363.35490
4996,BJFT84287006,736,1300028.16,2012,1,1,1,58.12,3,1,...,https://bj.lianjia.com/xiaoqu/1111027375508/,0,1,770.89444,9,777.47225,1,1,6221.44255,6278.89885
4997,BJDC84781079,200,2190056.96,2012,2,1,1,58.24,3,1,...,https://bj.lianjia.com/xiaoqu/1111027374239/,2,1,715.50723,9,2053.92372,0,0,3779.07141,3505.28158
4998,BJSJ85075781,2133,1500007.54,2012,1,1,1,63.86,2,1,...,https://bj.lianjia.com/xiaoqu/1111027378940/,0,1,863.10517,9,413.21398,2,2,21094.05154,15576.85068


Note: we cannot use backslashes (`\`) alone to construct file path because 
backslashes (`\`) are treated as escape characters in Python strings

- Three Ways to Import a File in Python
1. Use a raw string by adding an `r` in front of the file path:
   ```python
   df = pd.read_csv(r"C:\Users\Documents\data.csv")
    ```
2. Replace backslashes (\) with double backslashes (\\):
    ```python
    df = pd.read_csv("C:\\Users\\Documents\\data.csv")
    ```
3. Replace backslashes (\) with forward slashes (/):
    ```python
    df = pd.read_csv("C:/Users/Documents/data.csv")
    ```

In [12]:
# examples: escape character
print("hello \n \n world! ")

hello 
 
 world! 



Similarly, to export a DataFrame as an Excel: 
   - use function `df.to_excel('<file path>/data.xlsx')`&mdash;export a DataFrame called df to an excel file and save in the designated folder.  

<p><a id="exploringourdata"> </a></p>
<h3 id="exploringourdata">2.3 <a href="#table">Exploring our data</a><a class="anchor-link" href="#exploring our data">¶</a></h3>

You can see here that (in addition to the formatting of tabular data in Jupyter) the main difference between this DataFrame and a series is that we have multiple columns with column labels. So the dataframe structure consists of: 

- An index, with index labels (here the labels are just `0`,`1`,...,`4999`)
- Columns, with column labels (e.g., `HouseID`, `CommunityID`,etc.)
- And the data, which are the values in each row. 

Now let us to some exploration

- To check the top five rows, use **`df.head(5)`**

In [15]:
# check the top five rows: 
df_2012.head(5)

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
0,BJFT84326414,1544,1400010.56,2012,2,1,1,69.68,bottom floor,0,...,https://bj.lianjia.com/xiaoqu/1111027377493/,0,2,633.24007,9,1803.02071,0,1,9345.20091,7396.31505
1,BJCP84958845,2606,1800066.0,2012,3,2,2,129.0,,0,...,https://bj.lianjia.com/xiaoqu/1111027380050/,0,0,2284.0939,9,9154.80958,0,0,18298.50637,18632.22305
2,BJDX84905788,2264,1350038.34,2012,2,1,1,88.83,3,1,...,https://bj.lianjia.com/xiaoqu/1111027379274/,0,1,667.21572,8,11158.05983,0,4,22480.82065,20105.0677
3,BJFT00386624,3621,1800006.91,2012,2,1,1,98.69,3,0,...,https://bj.lianjia.com/xiaoqu/1111027382765/,0,1,939.29061,9,1698.79101,0,10,16309.85203,11427.48851
4,BJCY84713854,1127,1970019.58,2012,1,1,1,53.66,4,0,...,https://bj.lianjia.com/xiaoqu/1111027376538/,0,3,476.28267,9,938.35742,2,0,8105.90581,7213.87518


In [16]:
# check the last five rows
df_2012.tail(5)

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
4995,BJTJ84718789,3360,1030038.24,2012,2,1,1,81.84,bottom floor,1,...,https://bj.lianjia.com/xiaoqu/1111027382053/,0,0,1669.04965,9,948.21718,1,0,15849.26185,21363.3549
4996,BJFT84287006,736,1300028.16,2012,1,1,1,58.12,3,1,...,https://bj.lianjia.com/xiaoqu/1111027375508/,0,1,770.89444,9,777.47225,1,1,6221.44255,6278.89885
4997,BJDC84781079,200,2190056.96,2012,2,1,1,58.24,3,1,...,https://bj.lianjia.com/xiaoqu/1111027374239/,2,1,715.50723,9,2053.92372,0,0,3779.07141,3505.28158
4998,BJSJ85075781,2133,1500007.54,2012,1,1,1,63.86,2,1,...,https://bj.lianjia.com/xiaoqu/1111027378940/,0,1,863.10517,9,413.21398,2,2,21094.05154,15576.85068
4999,BJCY84949599,430,1840034.6,2012,2,1,1,55.19,top floor,1,...,https://bj.lianjia.com/xiaoqu/1111027374756/,0,2,543.00065,7,522.73335,1,2,6672.24055,7184.44822


- To explore the shape (dimension) of the DataFrame, use **`df.shape`**

In [17]:
df_2012.shape

(5000, 30)

- To explore the column names, use **`df.columns`**. Note that the `df.columns` returns an object rather than a list. 

In [20]:
# check column names
df_2012.columns

Index(['HouseID', 'CommunityID', 'TotalPrice', 'TransYear', 'Bedroom',
       'Livingroom', 'Bathroom', 'Size', 'FloorLevel', 'WinSouth',
       'WinSouthNorth', 'Decoration', 'TotalFloor', 'BuiltYear', 'Elevation',
       'Heating', 'TransMonth', 'TransDay', 'District', 'CensusTract',
       'XIAOQUWEB', 'SchQuality', 'NumSubway1km', 'Dist2Subway', 'HospQuality',
       'Dist2Hosp', 'NumHosp1km', 'NumBus200m', 'Dist2CBD', 'Dist2Center'],
      dtype='object')

In [21]:
# Get the index of the variable "Bedroom"
list(df_2012.columns).index("Bedroom")

4

- Basic descriptive statistics for our dataframe **df.describe()**

In [23]:
# descriptive statistics
df_2012.describe()

Unnamed: 0,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,WinSouth,WinSouthNorth,Decoration,...,CensusTract,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,...,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1896.095,1947660.0,2012.0,1.9286,1.2076,1.1526,80.3066,0.7644,0.4682,0.0,...,119.4342,0.4338,0.8044,1183.742821,8.6344,3202.480164,0.2458,3.1464,13449.129254,13063.810653
std,1033.137016,932169.0,0.0,0.694837,0.505721,0.367342,29.529724,0.424416,0.499038,0.0,...,63.883396,1.020895,0.834674,1036.077987,0.684127,3374.224481,0.541702,4.734242,6524.513313,6682.127691
min,4.0,429000.0,2012.0,1.0,0.0,1.0,24.4,0.0,0.0,0.0,...,2.0,0.0,0.0,42.98615,7.0,53.19033,0.0,0.0,990.31903,968.06103
25%,981.75,1325027.0,2012.0,1.0,1.0,1.0,58.4575,1.0,0.0,0.0,...,64.0,0.0,0.0,567.02409,9.0,1164.41655,0.0,0.0,8827.19323,7543.08136
50%,1968.5,1750016.0,2012.0,2.0,1.0,1.0,73.39,1.0,0.0,0.0,...,118.0,0.0,1.0,867.04662,9.0,1995.61647,0.0,1.0,12725.224445,11571.99157
75%,2752.0,2300078.0,2012.0,2.0,2.0,1.0,95.8525,1.0,1.0,0.0,...,177.0,0.0,1.0,1396.076155,9.0,3799.90208,0.0,5.0,18418.616755,17531.304565
max,3718.0,13300120.0,2012.0,4.0,2.0,3.0,292.93,1.0,1.0,0.0,...,226.0,4.0,4.0,8107.92547,9.0,23348.07469,4.0,42.0,37268.65097,40228.30348


- `.sort_values()` sorts your DataFrame by a certain column. If you column is numeric, it will sort the values from smallest to largest. If your column is a string, it will sort alphabetically.
    - The index values will be sorted along with the rows. This means not only are the row values rearranged, but their corresponding index values move along with them - the index stays associated with the same row,
    - By default, `df.sort_values()` does not modify the DataFrame in place. Instead, it returns a new DataFrame with the values sorted. To apply the sorting directly to the original DataFrame,  use the `inplace=True` parameter.
      - inplace=True: Modifies the original DataFrame directly without returning a new DataFrame.
      - inplace=False: Returns a new sorted DataFrame, leaving the original DataFrame unchanged.

In [26]:
# sort from the smallest to the largest
df_2012.sort_values("TotalPrice")

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
4572,BJHD84586553,2268,429000.00,2012,2,1,1,52.80,3,1,...,https://bj.lianjia.com/xiaoqu/1111027379278/,1,2,494.93863,9,1018.48756,0,0,12814.51532,7597.30985
4360,BJDX84234421,1792,465041.19,2012,1,2,1,67.29,3,1,...,https://bj.lianjia.com/xiaoqu/1111027378083/,0,0,5480.12539,8,22013.62655,0,1,34020.54729,32407.04761
668,BJSY84820479,3065,500024.60,2012,1,0,1,32.66,3,1,...,https://bj.lianjia.com/xiaoqu/1111027381324/,0,1,160.09919,9,15515.11968,0,11,29219.05096,32530.22463
1832,BJCY84803356,2215,520032.42,2012,1,1,1,36.78,top floor,1,...,https://bj.lianjia.com/xiaoqu/1111027379159/,0,0,1712.15277,9,6616.04165,0,2,19522.51032,23249.49135
4833,BJCY84404564,2215,550017.70,2012,1,1,1,33.70,top floor,1,...,https://bj.lianjia.com/xiaoqu/1111027379159/,0,0,1712.15277,9,6616.04165,0,2,19522.51032,23249.49135
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,BJXC84867016,3206,7916136.18,2012,4,1,2,147.34,3,1,...,https://bj.lianjia.com/xiaoqu/1111027381668/,4,1,888.84356,9,1375.33226,0,1,8903.54161,6298.83422
1437,BJHD84925549,3518,8250036.00,2012,3,2,2,157.00,2,1,...,https://bj.lianjia.com/xiaoqu/1111027382487/,2,0,1297.85840,9,896.61896,1,3,15310.60488,9820.69326
738,BJHD84822957,376,8540189.46,2012,3,2,2,190.01,3,1,...,https://bj.lianjia.com/xiaoqu/1111027374627/,2,1,495.85158,9,1337.76996,0,0,15068.05800,9903.45366
1113,BJCY84196120,3470,9500207.97,2012,4,2,3,229.79,2,1,...,https://bj.lianjia.com/xiaoqu/1111027382344/,0,1,331.75223,7,1546.11645,0,0,6198.60201,7944.01154


In [27]:
# sort from the largest to the smallest
df_2012.sort_values("TotalPrice", ascending = False)

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
3645,BJHD84357872,2732,13300123.74,2012,4,2,3,249.22,3,1,...,https://bj.lianjia.com/xiaoqu/1111027380422/,2,2,498.20211,9,2533.70859,0,7,15680.12935,11144.58343
1113,BJCY84196120,3470,9500207.97,2012,4,2,3,229.79,2,1,...,https://bj.lianjia.com/xiaoqu/1111027382344/,0,1,331.75223,7,1546.11645,0,0,6198.60201,7944.01154
738,BJHD84822957,376,8540189.46,2012,3,2,2,190.01,3,1,...,https://bj.lianjia.com/xiaoqu/1111027374627/,2,1,495.85158,9,1337.76996,0,0,15068.05800,9903.45366
1437,BJHD84925549,3518,8250036.00,2012,3,2,2,157.00,2,1,...,https://bj.lianjia.com/xiaoqu/1111027382487/,2,0,1297.85840,9,896.61896,1,3,15310.60488,9820.69326
886,BJXC84867016,3206,7916136.18,2012,4,1,2,147.34,3,1,...,https://bj.lianjia.com/xiaoqu/1111027381668/,4,1,888.84356,9,1375.33226,0,1,8903.54161,6298.83422
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4833,BJCY84404564,2215,550017.70,2012,1,1,1,33.70,top floor,1,...,https://bj.lianjia.com/xiaoqu/1111027379159/,0,0,1712.15277,9,6616.04165,0,2,19522.51032,23249.49135
1832,BJCY84803356,2215,520032.42,2012,1,1,1,36.78,top floor,1,...,https://bj.lianjia.com/xiaoqu/1111027379159/,0,0,1712.15277,9,6616.04165,0,2,19522.51032,23249.49135
668,BJSY84820479,3065,500024.60,2012,1,0,1,32.66,3,1,...,https://bj.lianjia.com/xiaoqu/1111027381324/,0,1,160.09919,9,15515.11968,0,11,29219.05096,32530.22463
4360,BJDX84234421,1792,465041.19,2012,1,2,1,67.29,3,1,...,https://bj.lianjia.com/xiaoqu/1111027378083/,0,0,5480.12539,8,22013.62655,0,1,34020.54729,32407.04761


<p><a id="indexingandslicingofdataFrame"> </a></p>
<h3 id="indexingandslicingofdataFrame">2.4 <a href="#table">Indexing and slicing DataFrame</a><a class="anchor-link" href="#indexingandslicingofdataFrame">¶</a></h3>

selecting a subset of a DataFrame using **Indexing** and **Slicing**. 
- *Indexing* means selecting particular row or column from a DataFrame. 
- *Slicing* means selecting multiple rows and columns

Three ways of selecting particular rows and columns of an DataFrame object&mdash; `df[]`, `df.loc[rows_label , columns_label]` and `df.iloc[row_position , column_position]`. 
- **label** and **position** 
- A *label*: one name in the column list or an index in the row index (the column at far left). 
- A *position*: the corresponding position of column name or index in a sequence, starting from zero.

**1. Selecting elements based on `df[]`**: 
- `df["col_name_1"]` select the column named "col_name1", and return a Series.
- `df[["col_name_1", "col_name_5"]]`, select multiple columns together, and return a DataFrame.

In [33]:
# select HouseID based on column: 
df_2012["HouseID"]


0       BJFT84326414 
1       BJCP84958845 
2       BJDX84905788 
3       BJFT00386624 
4       BJCY84713854 
            ...      
4995    BJTJ84718789 
4996    BJFT84287006 
4997    BJDC84781079 
4998    BJSJ85075781 
4999    BJCY84949599 
Name: HouseID, Length: 5000, dtype: object


- try type in `df[["col_name_1"]]`, how it differs from `df["col_name_1"]`?


In [35]:
type(df_2012[["HouseID"]])

pandas.core.frame.DataFrame

In [36]:
# select multiple columns, all rows
df_2012[["HouseID", "TotalPrice"]]

Unnamed: 0,HouseID,TotalPrice
0,BJFT84326414,1400010.56
1,BJCP84958845,1800066.00
2,BJDX84905788,1350038.34
3,BJFT00386624,1800006.91
4,BJCY84713854,1970019.58
...,...,...
4995,BJTJ84718789,1030038.24
4996,BJFT84287006,1300028.16
4997,BJDC84781079,2190056.96
4998,BJSJ85075781,1500007.54


**2. Selecting elements based on `df.loc[ rows_label , columns_label ]`**:


In [42]:
# select the first 2 rows and all columns
df_2012.loc[ 0:1  ,   ]

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
0,BJFT84326414,1544,1400010.56,2012,2,1,1,69.68,bottom floor,0,...,https://bj.lianjia.com/xiaoqu/1111027377493/,0,2,633.24007,9,1803.02071,0,1,9345.20091,7396.31505
1,BJCP84958845,2606,1800066.0,2012,3,2,2,129.0,,0,...,https://bj.lianjia.com/xiaoqu/1111027380050/,0,0,2284.0939,9,9154.80958,0,0,18298.50637,18632.22305


In [47]:
# select the first 3 rows and column "CommunityID" on a dataframe, and returns a dataframe
df_2012.loc[ 0:2    ,  ["CommunityID"]  ]

Unnamed: 0,CommunityID
0,1544
1,2606
2,2264


In [49]:
# select the first 3 rows, and the columns "col_name_m" and "col_name_n"
df_2012.loc[  :2 ,   ["CommunityID", "TotalPrice"]]

Unnamed: 0,CommunityID,TotalPrice
0,1544,1400010.56
1,2606,1800066.0
2,2264,1350038.34


In [59]:
# select the rows with label 3 and 5, and columns from "HouseID" to "TotalPrice"
df_2012.loc[ [3, 5] , [ "HouseID", "TotalPrice" ]  ]

Unnamed: 0,HouseID,TotalPrice
3,BJFT00386624,1800006.91
5,BJFT85228189,1280012.6


**3. Selecting elements based on `df.iloc[ row_position , column_position ]`:**

In [52]:
# select first 3 rows and the second column
df_2012.iloc[ 0:3  , [1]  ]

Unnamed: 0,CommunityID
0,1544
1,2606
2,2264


In [55]:
# select first 3 rows and columns "HouseID", "TotalPrice": list().index()
df_2012.iloc[ 0:3  ,  [list(df_2012.columns).index("HouseID") , list(df_2012.columns).index("TotalPrice") ]  ]

Unnamed: 0,HouseID,TotalPrice
0,BJFT84326414,1400010.56
1,BJCP84958845,1800066.0
2,BJDX84905788,1350038.34


In [61]:
# select all rows and the last 2 columns "Dist2CBD" and "Dist2Center"
df_2012.iloc[ :  ,  -2:   ]

Unnamed: 0,Dist2CBD,Dist2Center
0,9345.20091,7396.31505
1,18298.50637,18632.22305
2,22480.82065,20105.06770
3,16309.85203,11427.48851
4,8105.90581,7213.87518
...,...,...
4995,15849.26185,21363.35490
4996,6221.44255,6278.89885
4997,3779.07141,3505.28158
4998,21094.05154,15576.85068


<p><a id="filtering"> </a></p>
<h3 id="filtering">2.5 <a href="#table">Filtering</a><a class="anchor-link" href="#filtering">¶</a></h3>

Let us filter the DataFrame based on specified conditions. Filtering in pandas is done with a boolean expression. The expression is evaluated for each row in the dataframe, and only rows where the expression evaluates to True are returned.

Let us say we want to select those housing samples located within 1500m of the subway stations. Those units are usually marked as "subway housing" for which buyers are willing to pay more.

In [68]:
# select housing samples located within 1500m of the subway stations
df_subway = df_2012[df_2012["Dist2Subway"] <=1500]
df_subway

# You can always leave out the column argument in either .loc or .iloc in order to select all columns.

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
0,BJFT84326414,1544,1400010.56,2012,2,1,1,69.68,bottom floor,0,...,https://bj.lianjia.com/xiaoqu/1111027377493/,0,2,633.24007,9,1803.02071,0,1,9345.20091,7396.31505
2,BJDX84905788,2264,1350038.34,2012,2,1,1,88.83,3,1,...,https://bj.lianjia.com/xiaoqu/1111027379274/,0,1,667.21572,8,11158.05983,0,4,22480.82065,20105.06770
3,BJFT00386624,3621,1800006.91,2012,2,1,1,98.69,3,0,...,https://bj.lianjia.com/xiaoqu/1111027382765/,0,1,939.29061,9,1698.79101,0,10,16309.85203,11427.48851
4,BJCY84713854,1127,1970019.58,2012,1,1,1,53.66,4,0,...,https://bj.lianjia.com/xiaoqu/1111027376538/,0,3,476.28267,9,938.35742,2,0,8105.90581,7213.87518
6,BJCY84112518,2767,2730093.60,2012,3,2,2,132.08,2,1,...,https://bj.lianjia.com/xiaoqu/1111027380518/,0,1,692.53062,9,5122.56097,0,0,11524.68492,16967.87510
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4994,BJCY84422499,109,1940059.80,2012,2,1,1,67.41,3,1,...,https://bj.lianjia.com/xiaoqu/1111027374001/,0,0,1083.57589,9,1131.66080,0,30,9079.72181,10821.04163
4996,BJFT84287006,736,1300028.16,2012,1,1,1,58.12,3,1,...,https://bj.lianjia.com/xiaoqu/1111027375508/,0,1,770.89444,9,777.47225,1,1,6221.44255,6278.89885
4997,BJDC84781079,200,2190056.96,2012,2,1,1,58.24,3,1,...,https://bj.lianjia.com/xiaoqu/1111027374239/,2,1,715.50723,9,2053.92372,0,0,3779.07141,3505.28158
4998,BJSJ85075781,2133,1500007.54,2012,1,1,1,63.86,2,1,...,https://bj.lianjia.com/xiaoqu/1111027378940/,0,1,863.10517,9,413.21398,2,2,21094.05154,15576.85068


- `df.loc[df["Dist2Subway"] <= 1500, : ]`:
  - step1, `df["Dist2Subway"] <= 1500` return a series with values of ***False*** or ***True*** (boolean type); this is known as **boolean indexing**
  - step2, it is enclosed by `df.loc[]` and can return a subset of the candidate rows
  - step3, assign the returned DataFrame to a new dataframe called `df_subway`

In order to filter by more than one condition, you must:
- Put all conditions in `()`
- Separate the condtions by: 
    - `|` if an `OR` condition     
    - `&` if an `AND` condition

In [84]:
# houses within 1500m of subway stations with at least two Bathrooms

df_2012[(df_2012["Dist2Subway"] <= 1500) & (df_2012["Bathroom"] >= 2)]

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
6,BJCY84112518,2767,2730093.60,2012,3,2,2,132.08,2,1,...,https://bj.lianjia.com/xiaoqu/1111027380518/,0,1,692.53062,9,5122.56097,0,0,11524.68492,16967.87510
16,BJHD84735168,73,4800000.00,2012,3,2,2,162.31,2,1,...,https://bj.lianjia.com/xiaoqu/1111027373903/,1,2,780.62340,9,2388.29366,0,2,12931.55797,8841.02198
20,BJCP84427284,2624,1945008.00,2012,3,1,2,124.68,top floor,1,...,https://bj.lianjia.com/xiaoqu/1111027380088/,0,0,1353.49228,9,2995.00650,0,4,22156.56301,20492.28868
24,BJCY84206540,219,5690099.85,2012,3,2,2,132.59,4,1,...,https://bj.lianjia.com/xiaoqu/1111027374273/,3,1,584.59806,9,2425.10859,0,0,1150.59048,5314.35936
26,BJDX84708873,3293,2350086.96,2012,3,2,2,131.82,3,1,...,https://bj.lianjia.com/xiaoqu/1111027381865/,0,0,1183.40088,8,2576.24539,0,5,11878.33320,12565.01538
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4947,BJCY84980409,428,4550118.30,2012,2,2,2,123.46,3,1,...,https://bj.lianjia.com/xiaoqu/1111027374751/,0,1,133.56460,9,901.77684,2,0,7059.91186,7359.74676
4964,BJSY84514643,2092,1440096.45,2012,3,2,2,116.09,top floor,1,...,https://bj.lianjia.com/xiaoqu/1111027378834/,0,1,720.06897,9,15070.14248,0,0,28857.17822,32072.73480
4965,BJCY84712249,1023,2270089.20,2012,2,1,2,97.95,4,1,...,https://bj.lianjia.com/xiaoqu/1111027376256/,0,1,730.91213,9,6431.98463,0,1,14835.75290,14871.11849
4970,BJTJ00464885,1222,1690015.62,2012,3,2,2,124.66,3,1,...,https://bj.lianjia.com/xiaoqu/1111027376743/,0,2,584.72264,9,3799.90208,0,1,19441.12723,24831.94635


Note: 
- Here, you cannot use the `.iloc` function becuase it used positional indexing. Boolean indexing (e.g., `(df_2012["Dist2Subway"] <= 1500) & (df_2012["Bathroom"] >= 2))` is used to filter rows or columns based on conditions, and it works with `df.loc[]` or directly within square brackets `df[]`.

<p><a id="modifyingelements"> </a></p>
<h3 id="modifyingelements">2.6 <a href="#table">Modifying elements</a><a class="anchor-link" href="#modifyingelements">¶</a></h3>

Essentially, in the same way that we can select elements we can also update them using the same logic.

In [78]:
# change the total price of a record: 4800155.94 -> 4800000
df_2012.loc[ df_2012["TotalPrice"] == 4800155.94 ,  "TotalPrice" ] = 4800000


Now, let us consider a more comprehensive example. Let us say I am interesting in comparing the average housing prices for the houses that are located at different distances from the nearest subway station.

To achieve this goal, we firstly create the categorical variable, denoting the subway accessibility. We can define a categorical variable 'Subwaylevel' that categorizes the continuous variable "Dist2Subway" (distance to subway) into 3 levels, 0-500m, 500-1500m, and beyond 1500m.

In [99]:
# note: we create a new variable - Sublevel
# It is not recommended to directly modify the original elements. 
df_2012["Sublevel"] = "Level 1"
# (df_2012["Dist2Subway"] > 500) & (df_2012["Dist2Subway"] <= 1500)
df_2012.loc[(df_2012["Dist2Subway"] > 500) & (df_2012["Dist2Subway"] <= 1500) , "Sublevel"] = "Level 2"
df_2012.loc[(df_2012["Dist2Subway"] > 1500) , "Sublevel"] = "Level 3"

<p><a id="pandasarithmetics"> </a></p>
<h3 id="pandasarithmetics">2.7 <a href="#table">Pandas Arithmetics</a><a class="anchor-link" href="#pandasarithmetics">¶</a></h3>


**1. calculate unit housing price** 
We only have "TotalPrice" by now, representing the total amount of money spent for a house. However, the normalized price (unitprice) makes more sense. So let us calculate the unit price

In [89]:
df_2012.columns

Index(['HouseID', 'CommunityID', 'TotalPrice', 'TransYear', 'Bedroom',
       'Livingroom', 'Bathroom', 'Size', 'FloorLevel', 'WinSouth',
       'WinSouthNorth', 'Decoration', 'TotalFloor', 'BuiltYear', 'Elevation',
       'Heating', 'TransMonth', 'TransDay', 'District', 'CensusTract',
       'XIAOQUWEB', 'SchQuality', 'NumSubway1km', 'Dist2Subway', 'HospQuality',
       'Dist2Hosp', 'NumHosp1km', 'NumBus200m', 'Dist2CBD', 'Dist2Center',
       'Sublevel'],
      dtype='object')

In [100]:
# UnitPrice = TotalPrice/Size
df_2012["UnitPrice"] = df_2012["TotalPrice"]/df_2012["Size"]
df_2012.sort_values("UnitPrice", ascending = False).UnitPrice
# Pandas arithmetics are vectorized, meaning that the operation is applied to each element in the column

2437    66845.0
2186    63254.0
1381    61112.0
4944    60247.0
3827    59458.0
         ...   
3599     8393.0
4572     8125.0
865      7856.0
1762     7112.0
4360     6911.0
Name: UnitPrice, Length: 5000, dtype: float64

<p><a id="groupby"> </a></p>
<h3 id="groupby">2.8 <a href="#table">Groupby function</a><a class="anchor-link" href="#groupby">¶</a></h3>


Now we want to group the houses based on `Sublevel`, and calculate the average housing price for each group. 

- `pd.groupby()`
    - similar to the pivotal table in Excel
    - involves three main steps
        - Splitting: Divides the DataFrame into groups based on a column.
        - Applying: Applies a function (e.g., mean, sum) to each group.
        - Combining: Combines the results into a new table.
    - In our case
        - Split the DataFrame by Sublevel (distance categories to subway).
        - Apply a function (e.g., mean()) to calculate the average price.
        - Combine the grouped results to form a summary table.- 


    - Notice that `pd.groupby()` does not return a DataFrame
        - we need to use a function, e.g., sum(), mean(), or apply() to make the return a DataFrame.

In [119]:
# groupby applied to UnitPrice
df_new = df_2012.groupby("Sublevel")[["UnitPrice"]].mean()
df_new

Unnamed: 0_level_0,UnitPrice
Sublevel,Unnamed: 1_level_1
Level 1,26882.939484
Level 2,25939.01719
Level 3,20328.317299


`.reset_index()` is a method that resets the index of a dataframe to a column of your choice. The default is to reset the index to a column of sequential numbers.

- `drop`: If True, the current index is removed completely and is not added as a column. The default is False.
- `inplace`: If True, the DataFrame is modified in place, and the method does not return a new DataFrame. The default is False.

In [121]:
# .reset_index()
df_new.reset_index(drop = False, inplace = True)
df_new

Unnamed: 0,Sublevel,UnitPrice
0,Level 1,26882.939484
1,Level 2,25939.01719
2,Level 3,20328.317299


In [133]:
# groupby sublevel and get the average values of "UnitPrice","Dist2Subway", and "Size"
df_new = df_2012.groupby("Sublevel")[["UnitPrice","Dist2Subway", "Size"]].mean().reset_index()
df_new

Unnamed: 0,Sublevel,UnitPrice,Dist2Subway,Size
0,Level 1,26882.939484,355.116527,81.074792
1,Level 2,25939.01719,902.736946,79.046465
2,Level 3,20328.317299,2713.126174,82.983673


## in-class exercise (due Jan 30th)

1. Import and Explore a Dataset - Find a dataset of your interest, read it into a Pandas DataFrame, and explore its structure.
- Display the first few rows
- Print the variable (column) names
- Print the dimensions (number of rows and columns)

## in-class exercise (due Feb 6th)

2. Import the `HouseBeijing2012.xlsx` data.
- Select the last two rows and the last three columns using both `df.loc[]` and `df.iloc[]` methods.
- Select the first three rows and the column "TransYear" and "Dist2Subway" using both `df.loc[]` and `df.iloc[]` methods.

3. `str1 = "12322378910"`, return a list which stores all indices with value "2". 

4. The unit of the `UnitPrice` we calculated is Chinese Dollar per square meters. (1) Convert the measurement unit to USD per sqft. (2) show the UnitPrice in descending order.  

5. Using the `HouseBeijing2012.xlsx` dataset, filter the dataset to find all the houses with:
- (1) A total price greater than 5 million CNY and 
- (2) A distance to the nearest subway station (Dist2Subway) less than or equal to 1000 meters.

Display the filtered dataset 

6. Group the dataset by the number of bedrooms (Bedroom) and calculate:
- The average unit price for each group.