# Reviewing Series

In [1]:
import pandas

One-dimensional data consists of a list or series of values, one another another.

Two-dimensional data has both columns and rows of data. Rows represent a single entitty each (like "people" or "books"). Columns represent a quality of that entity, like "name," "birthday," or "sales per year."

A Python list or a Pandas series are one-dimensional data.

A spreadsheet (.xlcx or .csv) or Pandas data frame are two-dimaensional data. A Python dictionary can also represent two-dimensional data.

Let's review how to create a series in Pandas. First, we create a Python list. Then, we "pass" (give) the list to the Pandas Series object.

With a series, you can use methods like .describe() and .mean() to get infrmation about the data in the series.

In [2]:
l = [10, 10, 5, 15, 20]

pandas_series = pandas.Series(l)

pandas_series.describe()

count     5.000000
mean     12.000000
std       5.700877
min       5.000000
25%      10.000000
50%      10.000000
75%      15.000000
max      20.000000
dtype: float64

# Creating a Data Frame

To create a data frame in Pandas, we can create a Python dictionary in which the keys represent column names and the values are lists that represent the items in the column.

Then we pass the dictionary to the Pandas data frame object.

Below, we create a dictionary, phone_book. The keys in the dictionary are "name" and "number." The values are the data in the column as Python lists.

In [19]:
phone_book = {
    'name': ['Pat', 'Fergus', 'Finnigan'],
    'number': [1111111, 4444444, 7777777]
}


df = pandas.DataFrame(phone_book)

df

Unnamed: 0,name,number
0,Pat,1111111
1,Fergus,4444444
2,Finnigan,7777777


# Import Airbnb Data

You can also create a dictionary directly from a CSV or .xlsx file. In the below, I have uploaded a .csv file to a particular URL so that we can import it easily below. In assignment #1, you will be loading the file from your hard drive or from a file uploaded to Google Colab, and some example code has been provided in the assignment file. That means the code for your assignment will look slightly different.

Below, we import some data for Airbnb rentals in New York City.

In [5]:
df = pandas.read_csv('http://bit.ly/airbnbcsv')

df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


We can use the  info() method on our dataframe variable (df) to see the columns present in the data as well as the number of rows and types of data in each column.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

To pull out a column, we can use thedf.<column_name> syntax, as below:

In [7]:
df.price

0        149
1        225
2        150
3         89
4         80
        ... 
48890     70
48891     40
48892    115
48893     55
48894     90
Name: price, Length: 48895, dtype: int64

We can also use the df['<Column name>'] syntax, which is useful if there are spaces in our column names.

In [9]:
df['price']

0        149
1        225
2        150
3         89
4         80
        ... 
48890     70
48891     40
48892    115
48893     55
48894     90
Name: price, Length: 48895, dtype: int64

When we pull out a column, we can do anything to it that we could do with a series, such as describe(), min(), max(), or mean().

In [12]:
df.price.describe()

count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

The mean is the average price. The median is the middlemost price

In [13]:
df.price.mean()

152.7206871868289

In [15]:
df.price.median()

106.0

We can use the "unique" method to see all the unique items in a column. Another way to think about it is that we remove duplicates.

In [16]:
df.neighbourhood.unique()

array(['Kensington', 'Midtown', 'Harlem', 'Clinton Hill', 'East Harlem',
       'Murray Hill', 'Bedford-Stuyvesant', "Hell's Kitchen",
       'Upper West Side', 'Chinatown', 'South Slope', 'West Village',
       'Williamsburg', 'Fort Greene', 'Chelsea', 'Crown Heights',
       'Park Slope', 'Windsor Terrace', 'Inwood', 'East Village',
       'Greenpoint', 'Bushwick', 'Flatbush', 'Lower East Side',
       'Prospect-Lefferts Gardens', 'Long Island City', 'Kips Bay',
       'SoHo', 'Upper East Side', 'Prospect Heights',
       'Washington Heights', 'Woodside', 'Brooklyn Heights',
       'Carroll Gardens', 'Gowanus', 'Flatlands', 'Cobble Hill',
       'Flushing', 'Boerum Hill', 'Sunnyside', 'DUMBO', 'St. George',
       'Highbridge', 'Financial District', 'Ridgewood',
       'Morningside Heights', 'Jamaica', 'Middle Village', 'NoHo',
       'Ditmars Steinway', 'Flatiron District', 'Roosevelt Island',
       'Greenwich Village', 'Little Italy', 'East Flatbush',
       'Tompkinsville', 'Asto

# Indexing

Indexing is the practice of pulling out parts or subsets of the data. Essentially, you make a new data set from some portion of the data in which a certain proposition is true.

You can index by writing comparisons, such as df.price > 80 or df.neighbourhood == "Morningside Heights"

This will return a list of booleans. For each row where the proposition is true, the boolean value will be true. The rest of the rows will be false.

In [18]:
morningside_bools = df.neighbourhood == 'Morningside Heights'

morningside_bools

0        False
1        False
2        False
3        False
4        False
         ...  
48890    False
48891    False
48892    False
48893    False
48894    False
Name: neighbourhood, Length: 48895, dtype: bool

Most of the values are cut out from the middle, so it doesn't matter that the items you see are all False. There are True items in there.

Once we have the list of booleans, we can use it to create a new data frame consisting of only those rows in which the corresponding boolean is True.

In [20]:
morningside_df = df[morningside_bools]

morningside_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 346 entries, 182 to 48605
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              346 non-null    int64  
 1   name                            346 non-null    object 
 2   host_id                         346 non-null    int64  
 3   host_name                       346 non-null    object 
 4   neighbourhood_group             346 non-null    object 
 5   neighbourhood                   346 non-null    object 
 6   latitude                        346 non-null    float64
 7   longitude                       346 non-null    float64
 8   room_type                       346 non-null    object 
 9   price                           346 non-null    int64  
 10  minimum_nights                  346 non-null    int64  
 11  number_of_reviews               346 non-null    int64  
 12  last_review                     

Now we can see the mean value of an airbnb in morningside.

In [21]:
morningside_df.price.mean()

114.78323699421965