Pandas is a powerful data manipulation and analysis library in Python. It provides efficient data structures and functions to work with structured data.

It provides a wide range of functions and methods for data manipulation and analysis, allowing you to perform various operations like data transformations, aggregations, merging datasets, and more.

Lets start with some Pandas data structures.

## Pandas Series

A Series is a fundamental **one-dimensional**  data structure provided by the Pandas library in Python. It is a labeled array capable of holding data of various types, such as integers, floats, or strings.  Recall lists use integer-based indexing, not labels, and are limited to generally storing a single data type. Lists are used for general-purpose collections of items, while as you will see a Pandas Series is specifically designed for data analysis and data manipulation

In [26]:
import pandas as pd

def println(s):           #println function to add an 'extra' newline
    print(s, "\n")

# Create a Series from a list
prices = [500000, 600000, 450000, 700000, 550000]   #list
prices_series = pd.Series(prices)                   #Series
print(prices_series)

# Access a single element usng .loc and for label-based access
print(prices_series.loc[0])

# Access a single element using .iloc for position-based indexing
println(prices_series.iloc[2])


0    500000
1    600000
2    450000
3    700000
4    550000
dtype: int64
500000
450000 



Lets review how we created the series

The variable prices is assigned a list of integers, where each integer represents the price of a house. The variable addresses is assigned a list of strings, where each string represents the address of a house. Each address corresponds to the price in the prices list by their positions in the respective lists.


The pd.Series(data, index) constructor is used to create a new Series. The data argument is where you pass the list of values you want in your series. In our case, prices is passed as the data.

The .loc indexer is used for label-based selection. This means that you use the actual labels (index names or column names) to select data.


    println(prices_series.loc[0])

The .iloc indexer is used for integer-location based indexing, allowing you to specify rows and columns by their integer position.

    println(prices_series.iloc[2])

The index argument is optional. If no index is provided, Pandas automatically creates a default integer index. It's a sequence of numbers starting from 0.

### Custom Indices:
We can also provide a custom index, which we will do here using addresses. This way, each value from prices is mapped to a corresponding custom label from addresses, making the data more meaningful and easier to handle.

In [29]:
# Create a Series with custom index labels
prices = [500000, 600000, 450000, 700000, 550000]
addresses = ['123 Main St', '456 Elm St', '789 Oak St', '321 Pine St', '654 Maple St']
prices_series = pd.Series(prices, index=addresses)  # create a series with a custom index label
println(prices_series)

123 Main St     500000
456 Elm St      600000
789 Oak St      450000
321 Pine St     700000
654 Maple St    550000
dtype: int64 



prices_series is now a Pandas Series object where each house price from the prices list is associated with a specific address from the addresses list. Instead of default numerical indices, you have string labels (the addresses) for indexing.

    prices_series = pd.Series(prices, index=addresses)

This is particularly useful when you want to retrieve the price of a house located at a specific address without having to know its position in the dataset.

## Labeled based access

Lets try and access an element (variable) from prices_series iloc and loc again.

iloc works but Why does loc throw an error?

In [28]:

# Access a single element using iloc
println(prices_series.iloc[4])
# Access a single element using label based indexing loc
println(prices_series.loc[0])


550000 



KeyError: 0

The .iloc indexer works because its used for integer-location based indexing, loc() fails because we chage the label index from integers to addresses!

In [30]:
# Fix the error. Access a single element using the correct index
println(prices_series.loc['654 Maple St'])

550000 



## Series Index

A Series index refers to the labels assigned to each element or value in the Series. It is a unique identifier associated with each item in the Series and serves as a way to access and reference the data within the Series. The index can be used to retrieve specific values or perform operations on specific subsets of data within the Series.

## Indexing
By default, when a Series is created, it is assigned a numeric index starting from 0 and incrementing by 1 for each element. However, as we have seen, we can assign a custom index label


In [31]:
# Accessing elements in a Series
print(prices_series)
print(prices_series[0])                # Access by position  (the first element of the Series is s[0] so note the zero indexing)
print(prices_series['123 Main St'])    # Access by index label


123 Main St     500000
456 Elm St      600000
789 Oak St      450000
321 Pine St     700000
654 Maple St    550000
dtype: int64
500000
500000



In pandas, when you use a custom index like you have with addresses, the Series object allows access in two ways: through label-based indexing and position-based indexing. This dual-access method is designed for convenience and flexibility (but it can sometimes cause confusion)

## Slicing
Slicing in Python, including slicing in Pandas, refers to the technique of extracting a subset of elements or rows from a data structure such as a list, array, or DataFrame. It allows you to select a range or specific elements based on their positions or labels.

In the context of Pandas, slicing is commonly used on Series and DataFrames to extract a portion of the data based on row or column positions, or based on index labels.



In [32]:
# Slicing based on positions
print(prices_series[2:4])

# Slicing a Series based on index labels
prices_series = pd.Series([10, 20, 30, 40, 50], index=['A', 'B', 'C', 'D', 'E'])
sliced_series = prices_series['B':'D']  # Extracts elements with index labels 'B', 'C', and 'D'
print(sliced_series)

789 Oak St     450000
321 Pine St    700000
dtype: int64
B    20
C    30
D    40
dtype: int64


Why did

    println(prices_series[2:4])

print just two values? In Python indexing, the start index is inclusive while the end index is exclusive.

Unlike numerical slicing, label-based slicing in pandas is inclusive of the end-label. Thus three values are printed for the example below


    sliced_series = prices_series['B':'D']  # Extracts elements with index labels 'B', 'C', and 'D'


## Operations on Series

There are many basic operations we can perform on a series including logical, arithmetic, stastical, and sting manipulations

In [33]:
import pandas as pd

println(prices_series)
println(type(prices_series))

# add $10,000 to the price
print("add $10,000 to the price:\n", prices_series + 10000)

# calculate the mean
mean_price = prices_series.mean()
print("\n calculate the mean:\n", mean_price )

# calculate the sum
sum_prices = prices_series.sum()
print("\n calculate the sum :\n", sum_prices)

# Filter
print("\n Boolean Filter :\n", prices_series > 10)                  # boolen filter
print("\n return subseries  :\n  ", prices_series[prices_series > 10])   # return subseries


# Sort
print("\n Sort by value  :\n", prices_series.sort_values())  # sort by values
print("\n Sort by index :\n", prices_series.sort_index())   # sort by index

# Use .replace() to replace values in a Series.
s_replaced = prices_series.replace(2, 710000)
print("\n Replace values :\n", s_replaced)

# Lets create a cities series
cities_series = pd.Series(['New York', 'Boston', 'Chicago', 'London', 'Berlin', 'Boston'])

# Retrieve unique values and the count of unique values.
print("\n Unique values  :\n",cities_series.unique())

# String methods under .str can be used to manipulate textual data.
print("\n Set text to upper case :\n", cities_series.str.upper())


A    10
B    20
C    30
D    40
E    50
dtype: int64 

<class 'pandas.core.series.Series'> 

add $10,000 to the price:
 A    10010
B    10020
C    10030
D    10040
E    10050
dtype: int64

 calculate the mean:
 30.0

 calculate the sum :
 150

 Boolean Filter :
 A    False
B     True
C     True
D     True
E     True
dtype: bool

 return subseries  :
   B    20
C    30
D    40
E    50
dtype: int64

 Sort by value  :
 A    10
B    20
C    30
D    40
E    50
dtype: int64

 Sort by index :
 A    10
B    20
C    30
D    40
E    50
dtype: int64

 Replace values :
 A    10
B    20
C    30
D    40
E    50
dtype: int64

 Unique values  :
 ['New York' 'Boston' 'Chicago' 'London' 'Berlin']

 Set text to upper case :
 0    NEW YORK
1      BOSTON
2     CHICAGO
3      LONDON
4      BERLIN
5      BOSTON
dtype: object


# The Data Frame!

A DataFrame is a two-dimensional labeled data structure in Pandas. Its tabular fromat is similar to a table or a spreadsheet with rows and columns. DataFrames are widely used for data manipulation, exploration, and analysis.

DataFrames are one of the most powerful and flexible tools available for data analysis and manipulation in Python, largely because of their rich functionality and ease of use.  Lets create our first DataFrame using a dictionary

In [34]:
import pandas as pd

# Create a DataFrame from a dictionary

# data dictionary
data = {
    'Address': ['123 Main St', '456 Elm St', '789 Oak St', '321 Pine St', '654 Maple St'],
    'Price': [500000, 600000, 450000, 700000, 550000],
    'Type': ['Single-Family', 'Condo', 'Townhouse', 'Single-Family', 'Apartment'],
    'Bedrooms': [3, 4, 2, 3, 3],
    'Bathrooms': [2, 2.5, 1.5, 3, 2]
}

# Create a DataFrame using pd.DataFrame
df = pd.DataFrame(data)
print(df)


        Address   Price           Type  Bedrooms  Bathrooms
0   123 Main St  500000  Single-Family         3        2.0
1    456 Elm St  600000          Condo         4        2.5
2    789 Oak St  450000      Townhouse         2        1.5
3   321 Pine St  700000  Single-Family         3        3.0
4  654 Maple St  550000      Apartment         3        2.0


Here are its components:

**Columns**: Each key in the dictionary represents a column in the resulting DataFrame. In this case, our DataFrame will have the following columns: 'Address', 'Price', 'Type', 'Bedrooms', and 'Bathrooms'.

**Rows**: Each list associated with a column key represents the values in that column. Each position in the list corresponds to a row in the DataFrame. So, the values at index 0 across all lists form the first row, the values at index 1 form the second row, and so on.

You can also create a DataFrame from series, from arrays (Numpy), from csv files, queries, API calls etc but more on this later


## Slicing a DataFrame

DataFrames can be sliced to extract a Series(column).

In [35]:
# Accessing columns in a DataFrame
address_series = df['Address']           # Access a single column
println(address_series)

price_series = df['Price']           # Access a single column
println(price_series)

0     123 Main St
1      456 Elm St
2      789 Oak St
3     321 Pine St
4    654 Maple St
Name: Address, dtype: object 

0    500000
1    600000
2    450000
3    700000
4    550000
Name: Price, dtype: int64 



## Extracting Multipe columns

We can also specify multiple columns to createa new DataFrame. Note the double square brackets

In [36]:
 # Slice multiple columns
address_price_df = df[['Address', 'Price']]

print(address_price_df)


        Address   Price
0   123 Main St  500000
1    456 Elm St  600000
2    789 Oak St  450000
3   321 Pine St  700000
4  654 Maple St  550000


The .loc indexer is used to access and manipulate data in a DataFrame based on label-based indexing. It allows you to select specific rows and columns using index labels.

Recall the .iloc indexer: The .iloc indexer is used for position-based indexing in a DataFrame. It allows you to select specific rows and columns using integer-based positions.



In [37]:
# Accessing rows in a DataFrame

println(df.loc[1])   #  Accesses the DataFrame by index labels. In this case, it returns the row for which the index label is 1.

println(df.iloc[2])  # Accesses the DataFrame by integer position. It returns the row at the 2nd position (which is the third row, as indexing starts at 0).



Address      456 Elm St
Price            600000
Type              Condo
Bedrooms              4
Bathrooms           2.5
Name: 1, dtype: object 

Address      789 Oak St
Price            450000
Type          Townhouse
Bedrooms              2
Bathrooms           1.5
Name: 2, dtype: object 



By default, when you create a DataFrame without specifying an index, it automatically generates a default, zero-based, numerical index. This default index is not any of the columns from your data but an auto-generated sequence created by Pandas for ease of reference, starting from 0 and increasing by 1 for each row.


If we want to access data using text labels instead of numeric indices, you can set the 'Address' column as the index of the DataFrame. Here's an updated example:

In [38]:
# Set 'Address' as the index
df.set_index('Address', inplace=True)

print( df.loc['456 Elm St'])

Price        600000
Type          Condo
Bedrooms          4
Bathrooms       2.5
Name: 456 Elm St, dtype: object


## Filtering DataFrames

Filtering a dataframe means selecting specific rows that meet certain conditions, similar to filtering a series.


For example; to filter the DataFrame based on the number of bedrooms, we'll use boolean indexing. This process involves creating a series of boolean values (True or False) that result from checking a condition against each row in the DataFrame. Rows that meet the condition (where the condition is True) are included in the result.

Let's say we want to filter out all the houses that have 3 bedrooms. Here's how we can do it:

In [44]:
println(df['Price'] > 560000)

# Filter the DataFrame based on the condition (houses that have 3 bedrooms)
filtered_df = df[df['Bedrooms'] == 3]

println(filtered_df)


Address
123 Main St     False
456 Elm St       True
789 Oak St      False
321 Pine St      True
654 Maple St    False
Name: Price, dtype: bool 

               Price           Type  Bedrooms  Bathrooms
Address                                                 
123 Main St   500000  Single-Family         3        2.0
321 Pine St   700000  Single-Family         3        3.0
654 Maple St  550000      Apartment         3        2.0 



There actually a lot happinging in the first line of code. Lets break it down

1. We create a DataFrame named df from the specified dictionary data.
2. We apply a filter condition df['Bedrooms'] == 3. This checks each row in df to see if the 'Bedrooms' column has a value equal to 3. It creates a boolean Series.

    filtered_df = df[df['Bedrooms'] == 3]

3. We use this boolean Series to index into df. The DataFrame treats this as a signal to filter rows, keeping rows with True (bedrooms equal to 3) and discarding rows with False.
4. The result is stored in filtered_df, which should now contain only the rows from the original DataFrame where the number of bedrooms is 3.
5. Finally, we print filtered_df to see the filtered data.

## Describe

We can use the df.describe() method to get summary statistics of the data, such as count, mean, standard deviation, minimum, and maximum values for numeric columns.

Accessing specific columns is straightforward.  

We can filter the data based on conditions using boolean indexing. In the example, df['Column_Name'] > 100 filters the DataFrame to include only rows where the values in the 'Column_Name' column are greater than 100. The filtered data is stored in the filtered_data variable.


In [45]:
# Describe

df.describe()

Unnamed: 0,Price,Bedrooms,Bathrooms
count,5.0,5.0,5.0
mean,560000.0,3.0,2.2
std,96176.920308,0.707107,0.570088
min,450000.0,2.0,1.5
25%,500000.0,3.0,2.0
50%,550000.0,3.0,2.0
75%,600000.0,3.0,2.5
max,700000.0,4.0,3.0


The describe method, by default, provides the following statistical details:

* count: Number of non-null entries.
* mean: Mean of column values.
* std: Standard deviation.
* min: Minimum value.
* 25%: 25th percentile.  (aka first quartile) value below which 25% of the data in a dataset falls
* 50%: Median (or 50th percentile). (aka second quartile)
* 75%: 75th percentile. (aka third quartile)
* max: Maximum value.

To include summaries of non-numeric data, use the include parameter.



In [46]:
summary_all = df.describe(include='all')
println(summary_all)

println(df)

                Price           Type  Bedrooms  Bathrooms
count        5.000000              5  5.000000   5.000000
unique            NaN              4       NaN        NaN
top               NaN  Single-Family       NaN        NaN
freq              NaN              2       NaN        NaN
mean    560000.000000            NaN  3.000000   2.200000
std      96176.920308            NaN  0.707107   0.570088
min     450000.000000            NaN  2.000000   1.500000
25%     500000.000000            NaN  3.000000   2.000000
50%     550000.000000            NaN  3.000000   2.000000
75%     600000.000000            NaN  3.000000   2.500000
max     700000.000000            NaN  4.000000   3.000000 

               Price           Type  Bedrooms  Bathrooms
Address                                                 
123 Main St   500000  Single-Family         3        2.0
456 Elm St    600000          Condo         4        2.5
789 Oak St    450000      Townhouse         2        1.5
321 Pine St   700

For object data (like strings), describe will provide:

* count: Number of non-null entries.
* unique: Number of distinct values.
* top: Most frequent value.
* freq: Frequency of the top value.

You can also exclude certain data types.

summary_exclude_str = df.describe(exclude[object])

Understanding the statistical summary returned by describe is fundamental in initial data exploration, enabling you to discern distributions, tendencies, and potential outliers in your data. More on this later!

# Exercises

## Exercise 1: Series Creation and Operations

Create a Pandas Series that represents the number of bedrooms in different houses. Use custom string indexes representing the house names (e.g., "House A", "House B", etc.). Perform the following operations:

1. Create a Series with custom indexes
2. Access the number of bedrooms in "House B".
3. Calculate the average number of bedrooms in all houses.
4. Replace the number of bedrooms in "House C" with a new value.

## Exercise 2: DataFrame Manipulations

Given a DataFrame containing information about different houses (similar to the one in the lesson), perform the following tasks:

1. Filter the DataFrame to only include houses with more than 2 bathrooms.
2. Create a new DataFrame that contains only the 'Address' and 'Price' columns.
3. Use the .describe() method to get a summary of the 'Price' column.

## Exercise 3: Data Analysis and Manipulation

Using the original DataFrame from the lesson (with 'Address' as indexes), perform the following:

Access the data of the house located at '789 Oak St'.
Update the price of the house at '123 Main St' to a new value.
Find out the mean price of houses with 3 bedrooms.

