# Data frames with Pandas

## Programming and Data Management (EDI 3400)

### *Vegard H. Larsen (Department of Data Science and Analytics)*

# 1. What is Pandas?

Pandas is a premier data manipulation and analysis library in Python, providing intuitive structures for organizing and processing data alongside a suite of powerful tools to explore, clean, and analyze it. Central to Pandas are its two primary data structures: the `Series`, which handles one-dimensional data, and the `DataFrame`, designed for two-dimensional data (akin to tables in a database or Excel sheets). With these structures, you can easily read data from various sources, manipulate rows and columns, handle missing values, and even merge or aggregate data from multiple tables. Many tasks traditionally done in spreadsheet software can be more efficiently and robustly performed in code, laying a foundational bridge between basic Python programming and the vast world of data analysis.

## Introduction to Pandas

- A library that makes working with multidimensional structured and tabular data fast and easy
- The name is derived from *panel data* and *Python data analysis* 
- Built in support for working with time series data
- Provides Excel-like functionality to Python
- Makes data cleaning and analysis fast and convenient in Python

## Importing Pandas 

- As with NumPy there is a common import convention for Pandas

In [None]:
import pandas as pd

In [None]:
pd.__version__

# 2. Pandas data structures 

## Data structures in `Pandas`

1. Series - One dimensional array of data 
2. DataFrame - Can consist of many Series as columns in the DataFrame
3. Panel - Can consist of many DataFrames (will not be covered in this course)

## Pandas Series-object

- A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)

In [None]:
import pandas as pd

series1 = pd.Series([100, 200, 300, 400])

In [None]:
series1

In [None]:
type(series1)

## Working with the Series

In [None]:
# Getting out values

series1[0]

In [None]:
# Assigning values

series1[1] = 1234

In [None]:
series1

In [None]:
# Slicing

series1[1:3]

## Numerical operations 

In [None]:
# We can initialize a Series from a dictionary

s1 = pd.Series({'a': 10, 'b': 20, 'c': 30})
s2 = pd.Series({'a': 0.1, 'b': 0.2, 'c':0.3, 'd':0.4, 'e': 0.5})

In [None]:
#s2

In [None]:
s1 * 2

In [None]:
s3 = s1 + s2
s3

## Series methods

In [None]:
# Drop nans

s3.dropna()
s3

In [None]:
# Drop particular indexes

s3.drop(['a', 'e'])

In [None]:
# We can concatenate data

pd.concat([s1,s2])

## Pandas DataFrame-object

In [None]:
import pandas as pd

content = [['a', 1, 'apple'], 
           ['b', 2, 'banana'], 
           ['c', 3, 'orange']]

dataframe1 = pd.DataFrame(content,
                          columns=['letter', 'number', 'fruit'],
                          index=['one', 'two', 'three'])

In [None]:
dataframe1

In [None]:
type(dataframe1)

## Working with the DataFrame

In [None]:
column_fruit = dataframe1['fruit']

In [None]:
column_fruit

In [None]:
type(column_fruit)

In [None]:
# Use loc to access a row from the DataFrame

row_1 = dataframe1.loc['two']

In [None]:
row_1

In [None]:
## Use iloc to access row by numerical index

row_0 = dataframe1.iloc[0]

In [None]:
row_0

## Viewing data

In [None]:
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4), columns=list("ABCD"))

In [None]:
# Look at the 3 first rows

df.head(10)

In [None]:
# Look at the 3 last rows

df.tail(3)

In [None]:
# We can look at a random sample of rows

df.sample(3)

## Sorting data

In [None]:
# We can sort by the values in a given column

df.sort_values(by='A')

## Slicing and selection

In [None]:
# Selecting via [] slices the rows

df[6:8]

In [None]:
# We can also slice the columns

df.iloc[:, 1:3]

In [None]:
# We can also ask for very specific slices

df.iloc[[0, 5, 6, 8], [0, 3]]

## Statistical methods 

In [None]:
import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Generating data on subjecty scores in 3 subjects for 100 students.
data = {
    'Math': np.random.randint(50, 100, 100),
    'English': np.random.randint(50, 100, 100),
    'History': np.random.randint(50, 100, 100)
}

df = pd.DataFrame(data)
print(df.head())  # Print the first 5 rows

In [None]:
# Mean Score of Each Subject:

mean_scores = df.mean()
print("Mean Scores:\n", mean_scores)

In [None]:
# Median Score of Each Subject:

median_scores = df.median()
print("\nMedian Scores:\n", median_scores)

In [None]:
# Standard Deviation of Each Subject:

std_dev = df.std()
print("\nStandard Deviation:\n", std_dev)

In [None]:
# Highest and Lowest Score in Math:

max_math = df['Math'].max()
min_math = df['Math'].min()

print("\nHighest Math Score:", max_math)
print("Lowest Math Score:", min_math)

In [None]:
# Correlation Between Subjects:

correlation = df.corr()
print("\nCorrelation between subjects:\n", correlation)

In [None]:
# Number of Students Scoring Above 90 in English:

above_90_english = df[df['English'] > 90].shape[0]
print("\nNumber of students scoring above 90 in English:", above_90_english)

# 3. Time series data

## Pandas has great functionality for working with dates

In [None]:
# We can create an index with dates 

dates = pd.date_range(start="2022-09-01", periods=30, freq='M')

In [None]:
dates

In [None]:
#pd.date_range?

### freq-options (non-exclusive list)

| Within Day | Within Month | Lower frequency |
| --- | --- | --- |
| S (seconds)        | D (calendar day) | M (monthly)          |
| T (minutely)       | B (business day) | QS (quarterly start) |
| H (hourly))        | W (weekly)       | Q (quarterly end)    |
| BH (business hour) | SM (semi-month)  | A, Y - (yearly)      |

In [None]:
dates

In [None]:
# Create a dataframe with random numbers and use the index with dates

df = pd.DataFrame(np.random.randn(30, 4), 
                  columns=list("ABCD"),
                  index=dates)

# A new data frame that isthe cummulative sum of the random numbers
df_sum = df.cumsum()

In [None]:
# Pandas has built in plotting

df_sum.plot()

# 4. Importing and exporting data

In [None]:
# Pandas has many different read methods 

data = pd.read_clipboard()

In [None]:
data

In [None]:
ls

## From Excel:

In [None]:
data_excel = pd.read_excel('files/travel_changeFromSameMonth2019.xlsx')
data_excel.head()

In [None]:
# We can get Pandas to read the dates for us 

data_excel = pd.read_excel('files/travel_changeFromSameMonth2019.xlsx',
                     index_col=[0],
                     parse_dates=True)
data_excel.head()

In [None]:
data_excel.index

In [None]:
data_excel.plot()

## From csv (Comma Separated Values):

In [None]:
# Reading a csv-file

data_csv = pd.read_csv('files/travel_changeFromSameMonth2019.csv',
                      index_col=[0],
                      parse_dates=True)
data_csv.index

## Save dataframe as a csv-file

In [None]:
# Let's create some random data 

df1 = pd.DataFrame(np.random.randn(30,5),
                   columns=list('ABCDE'),
                   index=pd.date_range('1990-01-01', periods=30, freq='A'))
df1.tail()

In [None]:
# The .to_csv method saves the dataframe as a csv file

df1.to_csv('random_numbers2.csv')