# Basics - Indexing, Labelling and Ordering

We'll be using some data from AirBnB for this example: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

In [22]:
import pandas as pd

df = pd.read_csv("AB_NYC_2019.csv")
df.head(3)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365


## Indexing

So this means a lot of things depending on the context. For pandas the index is the number of the left, which is the unique value that can identify each row. By default, the index is generated by counting up from zero. But in this data, we can see that the database index (which is called the primary key) `id` would also be another good choice.

In [2]:
df2 = df.set_index("id")
df2.head(3)
#moves id from being a column in the datafram to being the index

Unnamed: 0_level_0,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365


In [4]:
# See how its pulling the index (id)
df2.name[2539]

'Clean & quiet apt home by the park'

In [5]:
# We'll cover grouping in way more detail in the next chapter
df3 = df.groupby("room_type").mean()
df3
#groups by room type and gives mean for everything else

Unnamed: 0_level_0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Entire home/apt,18438180.0,61755930.0,40.728649,-73.960696,211.794246,8.506907,22.842418,1.306578,10.698335,111.920304
Private room,19468930.0,72475140.0,40.729208,-73.942924,89.780973,5.3779,24.112962,1.445209,3.227717,111.203933
Shared room,23003780.0,102624100.0,40.730514,-73.943343,70.127586,6.475,16.6,1.471726,4.662931,162.000862


In [7]:
df3.reset_index()
#moves it back to dataframe

Unnamed: 0,room_type,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Entire home/apt,18438180.0,61755930.0,40.728649,-73.960696,211.794246,8.506907,22.842418,1.306578,10.698335,111.920304
1,Private room,19468930.0,72475140.0,40.729208,-73.942924,89.780973,5.3779,24.112962,1.445209,3.227717,111.203933
2,Shared room,23003780.0,102624100.0,40.730514,-73.943343,70.127586,6.475,16.6,1.471726,4.662931,162.000862


In [9]:
df3.reset_index(drop=True)

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,18438180.0,61755930.0,40.728649,-73.960696,211.794246,8.506907,22.842418,1.306578,10.698335,111.920304
1,19468930.0,72475140.0,40.729208,-73.942924,89.780973,5.3779,24.112962,1.445209,3.227717,111.203933
2,23003780.0,102624100.0,40.730514,-73.943343,70.127586,6.475,16.6,1.471726,4.662931,162.000862


## Sorting

I almost always use `sort_index` after setting it. If I want the df sorted, I commonly use `sort_values`

In [11]:
df3.sort_index(ascending=False)

Unnamed: 0_level_0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Shared room,23003780.0,102624100.0,40.730514,-73.943343,70.127586,6.475,16.6,1.471726,4.662931,162.000862
Private room,19468930.0,72475140.0,40.729208,-73.942924,89.780973,5.3779,24.112962,1.445209,3.227717,111.203933
Entire home/apt,18438180.0,61755930.0,40.728649,-73.960696,211.794246,8.506907,22.842418,1.306578,10.698335,111.920304


In [26]:
df.sort_values(["neighbourhood_group", "host_name"])
df.head(3)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365


In [27]:
df.neighbourhood_group.unique()

array(['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx'],
      dtype=object)

In [28]:
df.neighbourhood_group.value_counts()

Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: neighbourhood_group, dtype: int64

In [16]:
df.sort_values(["neighbourhood_group", "host_name"], ascending=[False, True], inplace=True)
#inplace changes original data frame. so this is not always reccomended

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
35913,28522394,The Spot,215277711,Aaron,Bronx,Van Nest,40.83988,-73.86978,Entire home/apt,300,1,0,,,1,365
32010,24991133,From home to home,91554527,Aboubakar,Bronx,Highbridge,40.83413,-73.92918,Private room,50,3,23,2019-04-07,1.64,1,188
4226,2772111,It's very warm and friendly.,14176488,Ada Azra,Bronx,Fordham,40.86705,-73.88545,Shared room,55,7,10,2018-10-13,0.16,1,365
14835,11751916,Large 2BR apt. steps from Subway,62530335,Adam,Bronx,Concourse,40.81990,-73.92810,Entire home/apt,110,1,0,,,1,0
35864,28466730,Akouaba,214888995,Adama,Bronx,Belmont,40.85267,-73.88627,Private room,50,2,51,2019-07-06,5.17,1,325
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37044,29455519,Quiet Comfty Home,171446547,Xiomara,Staten Island,Bull's Head,40.61700,-74.16485,Private room,80,1,29,2019-07-07,3.40,1,362
43210,33519749,Cozy And Stylish Modern Home,252590512,Yien,Staten Island,New Brighton,40.64098,-74.09350,Entire home/apt,249,2,8,2019-06-22,2.58,1,241
25146,20148331,"BIG-3 BDRM house, 1hr to Manhattan, near beach",7927832,Yulia,Staten Island,"Bay Terrace, Staten Island",40.55182,-74.14439,Entire home/apt,150,3,1,2018-08-16,0.09,1,0
38426,30250766,"Staten Island - Free Wifi, Parking Space, Near...",225160295,Yun,Staten Island,Rosebank,40.61438,-74.06640,Entire home/apt,138,1,51,2019-07-02,7.18,1,291


## Rank

Like sorting, but with collision detection.

In [29]:
dfp = df.sort_values("price", ascending=False) #sort reverse order
dfp[["id", "host_name", "price"]].head(5) # pulls out 3 columns and first 5 rows

Unnamed: 0,id,host_name,price
9151,7003697,Kathrine,10000
17692,13894339,Erin,10000
29238,22436899,Jelena,10000
40433,31340283,Matt,9999
12342,9528920,Amy,9999


In [30]:
dfp["price_rank"] = dfp.price.rank(method="max", ascending=False)

In [33]:
dfp[["id", "host_name", "price", "price_rank"]].head(5) 
#highest are ranked 3 since there are 3 values

Unnamed: 0,id,host_name,price,price_rank
9151,7003697,Kathrine,10000,3.0
17692,13894339,Erin,10000,3.0
29238,22436899,Jelena,10000,3.0
40433,31340283,Matt,9999,6.0
12342,9528920,Amy,9999,6.0


### Recap:

* set_index
* reset_index
* sort_values
* sort_index
* unique
* value_counts
* rank

## Next up: Slicing and Filtering