# Pandas

Pandas is a popular data analysis library in Python that makes it easy to work with large datasets. It provides two main structures: the DataFrame, which is like a table of data (similar to an Excel spreadsheet or SQL table), and the Series, which is a single column or row of data. With pandas, you can efficiently clean, filter, sort, and analyze data, which is essential for many data science and machine learning tasks


**Example 1**: Imagine you have data for various rooms in a building. In the table below, the first column is "Room," with names like "Office One" and "Lab 1," and each row represents one room. We have certain information for every room, like annual energy consumption (numbers), sensor availability (True or False values), and admin names (text).

In [40]:
import pandas as pd

# Create a DataFrame
data = {
    'Room': ['Office One', 'Office Two', 'Lab 1', 'Kitchen'],
    'Annual Energy Consumption (kWh)': [1500, 1200, 2300, 500],
    'Temperature Sensor': [True, False, True, False],
    'CO2 Sensor': [True, True, False, True],
    'Humidity Sensor': [False, True, True, False],
    'Admin': ['Tim', 'Tom', 'Sarah', 'Tom']
}

df = pd.DataFrame(data)


### Exercise: Exploring Your First DataFrame
Use the DataFrame `df` created from the data above.

Try out the following **methods** to get a quick overview of your data:

`df.describe()`
   - What does it show?

`df.shape`
   - What do the two numbers represent?

`df.info()`
   - What types of data are in each column?


Test each method and write 1–2 sentences in a comment summarizing what it tells you about the dataset.

### Exercise: Loading Data from a CSV File

In many cases, we are not creating our DataFrame manually in Python. Instead, we work with external data sources.

A common format for tabular data is the **CSV (Comma-Separated Values)** file.

Let's load the following file into a pandas DataFrame using the function `pd.read_csv()`. Store the result in a variable called `df`:

**File:** `students.csv`

If you encounter issues, use an LLM (like ChatGPT) for support.


Similar to the previous exercise, use the methods discussed (`df.describe()`, `df.shape`, `df.info()`) to get a quick overview of your data.

## Indexing Rows

The first column of the dataframe is the index. We can access rows in this dataframe the same way we did using indexes in lists. in the case of dataframes using the iloc function. (stands for integer location, so just the integer row number)

In [31]:
df.iloc[0]

Name              Alice
Height              165
Age                  25
YearsStudying         2
Sex              Female
Name: 0, dtype: object

Slicing operateors also work

In [21]:
df.iloc[3:5]

Unnamed: 0.1,Unnamed: 0,Name,Height,Age,YearsStudying,Sex
3,3,Diana,160,22,1,Female
4,4,Edward,180,28,4,Male


## Indexing columns

We can access single columns similar to dictionaries.

In [None]:
df['Height']

We can use lists of column names to get multiple with one operation

In [None]:
df[['Height', 'Age']]

To extract a specific value given row and column name

In [35]:
df.iloc[3]['Height']

160

In [36]:
df['Height'].iloc[3]

160

### Exercise: Indexing
- What is the height of the student in the 5th row?
- What is the name of the student in the 2nd row?

## Mathematical operations on columns

Columns can be used to perform mathematical operations in bulk. Very similar to numpy (pandas actually uses numpy behind the curtain to do this).
So we do not need loops, but can apply mathemtical operations directly to a whole bunch of data.

For example, how old were they when they started studying?

In [None]:

df['Age'] - df['YearsStudying']

## Adding new columns

Lets add the age when starting their studies back to the dataframe as a new column

In [None]:
df['StartAge'] = df['Age'] - df['YearsStudying']
df


**Exercise**: Calculate what percentage of their lifespan people have been studying (years studying / height) * 100

## Separating data based on conditions and grouping

While slicing per index is useful, in practice we are often interested to look only at a specific part of the data to perform some analysis. For example, compare the height of male and female students.

We can create True/False vectors if some condition is fulfilled or not:

In [None]:
df['Sex'] == 'Male'

While this alone isnt that useful, such a vector of booleans can be use as an indexer. In the following example, we use this to select only the males from the dataframe.

In [None]:
df[df['Sex'] == 'Male']

**Exercise**: Select all rows where the height is above 175

We can also chain such conditions, however, syntax is slightly different than regular python boolean operations

In [None]:
df[(df['Sex'] == 'Male') & (df['Age'] >= 35)]


**Exercise**: Select all rows of males with a height greater than 175


## Basic math functions on columns (or rows)

What weve already seen with 'describe' we can perform selectively aswell

In [None]:
print(df['Age'].mean())
print(df['Age'].std())

Also, all kinds of maths functions are available in pandas

In [None]:
print(df['Age'].sum())
print(df['Age'].min())
print(df['Age'].max())


Some functions return not a skalar, but again a series (a column)

In [None]:
print(df['Age'].cumsum())


**Exercise**: Calculate the mean and std of age of male and female students separately

## Loading and storing data

While pandas supports different formats, we will mostly deal with csv.

We can store a dataframe as csv file with 'to_csv'

In [37]:
df.to_csv('students_v2.csv')

And load it again, using 'pandas.read_csv'

In [39]:
df = pd.read_csv('students_v2.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'students_v2.csv'

## Visualization