# Pandas Tutorial 2: Dataframe Basics

This is arguably the most important tutorial in this entire series. It covers the fundamentals of working with **DataFrames** in Pandas, the core structure used to handle tabular data.

**Topics covered:**
- Creating a DataFrame
- Manipulating Rows and Columns
- Performing Operations: `min`, `max`, `std`, `describe`
- Conditional Selection
- Using `set_index`

Let's explore how to create and manipulate DataFrames to handle more complex data operations.

In [1]:
import pandas as pd

### Reading Data from a CSV File Using `read_csv()`

The `read_csv()` function in Pandas is used to load data from a CSV (Comma Separated Values) file into a DataFrame. In this case, the CSV file named `"weather_data.csv"` is read and stored in the DataFrame `df_1`. This is one of the most common methods to import structured data into Pandas for analysis.

**Key features:**
- `read_csv()`: Automatically parses the CSV data and loads it into a DataFrame.
- CSV files are commonly used for data storage due to their simplicity and wide support.

In [2]:
# Reading data from a CSV file
df_1 = pd.read_csv("weather_data.csv")
df_1

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


### Creating a DataFrame from a Dictionary

In this example, a dictionary named `weather_data` is used to create a Pandas DataFrame. Each key in the dictionary represents a column in the DataFrame, and the associated values (lists) represent the data for each column.

**Key features:**
- `pd.DataFrame(dictionary)`: Converts the dictionary into a DataFrame.
- This method is useful for quickly creating DataFrames from structured data stored in dictionary format.

In [3]:
# Creating a dictionary with weather data
weather_data = {
    'day': ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/5/2017','1/6/2017'],
    'temperature': [32,35,28,24,32,31],
    'windspeed': [6,7,2,7,4,2],
    'event': ['Rain', 'Sunny', 'Snow', 'Snow', 'Rain', 'Sunny']
}
df = pd.DataFrame(weather_data)
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [4]:
# Returns tuple of (rows, columns) wrt DataFrame
df.shape

(6, 4)

In [5]:
# Assigns tuple values into variables `rows` and `columns`
rows, columns = df.shape

In [6]:
columns

4

In [7]:
df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


### Selecting a Range of Rows Using Slicing

When slicing a DataFrame using `df[start:end]`, Pandas returns rows starting from the `start` index and ending before the `end` index (i.e., the `end` index is not included). In this case, `df[2:5]` returns rows with index 2, 3, and 4.

**Key features:**
- Slicing a DataFrame allows you to select specific ranges of rows.
- The start index is included, but the end index is **excluded**.

In [8]:
# Selects rows with index 2, 3, and 4
df[2:5]

Unnamed: 0,day,temperature,windspeed,event
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


In [9]:
# First 2 rows of df
df.head(2)

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny


In [10]:
# Last 2 rows of df
df.tail(2)

Unnamed: 0,day,temperature,windspeed,event
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


### Accessing Column Labels with `columns`

The `columns` attribute of a Pandas DataFrame returns the column labels as an `Index` object. This is useful for quickly inspecting the names of the columns in your DataFrame.

**Key features:**
- `df.columns`: Provides the column names as an index-like object.
- Helpful for verifying or programmatically accessing the column names.

In [11]:
# Column labels of df
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

### Accessing a DataFrame Column

You can access a specific column of a DataFrame using two syntaxes:
- `df['column_name']`: The preferred and more flexible method for accessing a column, particularly when the column name contains spaces or special characters.
- `df.column_name`: A shorthand method, but it only works if the column name is a valid Python attribute (i.e., no spaces or special characters).

**Key features:**
- Both methods return the column as a `Series` object.
- The first syntax is more robust, especially for complex column names.

In [13]:
df['event'] # Accesses and returns the 'event' column of the DataFrame df as a Series
# df.event  # Alternative syntax

0     Rain
1    Sunny
2     Snow
3     Snow
4     Rain
5    Sunny
Name: event, dtype: object

### Selecting Multiple Columns from a DataFrame

You can select multiple columns from a DataFrame by passing a list of column names. This returns a new DataFrame containing only the specified columns.

**Key features:**
- `df[['col1', 'col2', 'col3']]`: Creates a new DataFrame with the specified columns.
- Useful for narrowing down the data to specific columns you need for analysis.

In [14]:
# Return a new DataFrame containing only `event`, `day`, and `temperature` columns from df
df[['event','day','temperature']]

Unnamed: 0,event,day,temperature
0,Rain,1/1/2017,32
1,Sunny,1/2/2017,35
2,Snow,1/3/2017,28
3,Snow,1/4/2017,24
4,Rain,1/5/2017,32
5,Sunny,1/6/2017,31


### Finding the Max/Min Value in a Column

The `max()` & `min()` method is used to find the maximum value in a specific column of a DataFrame. In this case, it returns the highest and lower temperatures from the `temperature` column.

**Key features:**
- `df['column'].max()`: Returns the maximum value in the specified column.
- Useful for quickly identifying the highest value in a numeric column.

In [15]:
# Maximum value from `temperature` column of df
df['temperature'].max()

35

In [16]:
# Minimum value from `temperature` column of df
df['temperature'].min()

24

In [17]:
# Std Deviation of values in `temperature` column of df
df['temperature'].std()

3.8297084310253524

In [18]:
# Generates descriptive statistics for the numerical columns in df
df.describe()

Unnamed: 0,temperature,windspeed
count,6.0,6.0
mean,30.333333,4.666667
std,3.829708,2.33809
min,24.0,2.0
25%,28.75,2.5
50%,31.5,5.0
75%,32.0,6.75
max,35.0,7.0


### Conditional Selection in a DataFrame

You can filter a DataFrame based on a condition using the syntax `df[condition]`. In this case, it returns a new DataFrame where the `temperature` column has values greater than or equal to 32.

**Key features:**
- `df[condition]`: Filters rows based on the condition applied to a column.
- Returns a new DataFrame that meets the specified condition.

In [19]:
# Conditional Selection: `temperature` has values >= 32 (Returns new df) 
df[df.temperature>=32]

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
4,1/5/2017,32,4,Rain


### Conditional Selection Based on Maximum Value

This operation filters the DataFrame to return only the rows where the `temperature` column has the maximum value. It compares each value in the `temperature` column to the result of `df['temperature'].max()`.

**Key features:**
- `df[df.column == df['column'].max()]`: Filters rows based on the maximum value in the specified column.
- Useful for retrieving rows with the highest values in a column.

In [20]:
# Conditional Selection: `temperature` has maximum value (Returns new df) 
df[df.temperature==df['temperature'].max()]

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny


### Conditional Selection with Specific Columns

This operation first filters the DataFrame to include only the rows where `temperature` has the maximum value. It then selects only the `day` and `temperature` columns from those filtered rows.

**Key features:**
- `df[['col1', 'col2']][condition]`: Filters specific columns of the DataFrame based on a condition.
- Useful for narrowing down data to specific columns after applying a condition.

In [21]:
# Conditional Selection: `temperature` has maximum value (Returns new df)
# Filter only the `day` and `temperature` for the above
df[['day','temperature']][df.temperature==df['temperature'].max()]

Unnamed: 0,day,temperature
1,1/2/2017,35


In [22]:
# Returns index (row labels) of df
df.index

RangeIndex(start=0, stop=6, step=1)

In [23]:
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


### Setting a Column as the Index Using `set_index()`

The `set_index()` method sets a specified column (in this case, `day`) as the index of the DataFrame. The `inplace=True` argument modifies the original DataFrame rather than returning a new one.

**Key features:**
- `set_index('column_name')`: Sets the specified column as the index of the DataFrame.
- `inplace=True`: Modifies the DataFrame in place without creating a copy.

In [24]:
# Sets the `day` column as the index of the DataFrame and modifies the original DataFrame in place
df.set_index('day', inplace=True)

Now, the dates (e.g., **1/1/2017**, **1/2/2017**) are used as the index of the DataFrame df

**Note:** The `set_index` function by default returns a new DataFrame, but using `inplace=True` modifies the original DataFrame directly

In [25]:
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/2/2017,35,7,Sunny
1/3/2017,28,2,Snow
1/4/2017,24,7,Snow
1/5/2017,32,4,Rain
1/6/2017,31,2,Sunny


### Selecting a Row by Index Using `loc[]`

The `loc[]` method is used to access rows by index label. In this case, it returns the row where the index is `'1/4/2017'`. Since the `day` column has been set as the index, the method retrieves the corresponding data for that specific day.

**Key features:**
- `df.loc['index_label']`: Selects the row that matches the specified index label.
- This is particularly useful when the index consists of meaningful labels (e.g., dates).

In [26]:
# The row of df where the index is '1/4/2017'
df.loc['1/4/2017']

temperature      24
windspeed         7
event          Snow
Name: 1/4/2017, dtype: object

### Resetting the Index Using `reset_index()`

The `reset_index()` method resets the DataFrame index to the default integer index (0, 1, 2, ...). The original index (in this case, the `day` column) is converted back to a regular column. The `inplace=True` argument ensures that the DataFrame is modified in place.

**Key features:**
- `reset_index()`: Resets the index to the default integer index and converts the current index back to a column.
- `inplace=True`: Modifies the original DataFrame without creating a copy.

In [27]:
# Restores the default integer index
df.reset_index(inplace=True)
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [28]:
# Set event (Rain, Sunny, Snow) as the index
df.set_index('event',inplace=True)
df

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rain,1/1/2017,32,6
Sunny,1/2/2017,35,7
Snow,1/3/2017,28,2
Snow,1/4/2017,24,7
Rain,1/5/2017,32,4
Sunny,1/6/2017,31,2


In [29]:
# The row of df where the index is 'Snow'
df.loc['Snow']

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Snow,1/3/2017,28,2
Snow,1/4/2017,24,7
