# Pandas 

**Pandas**, short for **Python Data Analysis Library**, is the most preferred and widely used tool in data munging/wrangling. 

Pandas provides high-level data structures and functions that are designed to make working with structured or tabular data fast, easy and straigntforward. Pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL). 

It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data. Since data manipulation, preparation, and cleaning is such an important skill in data analysis, pandas is one of the primary focuses of this course.

_(source: Python for Data Analysis, Wes McKinney)_

More information about pandas package, including documentation and examples, can be found here: https://pandas.pydata.org/


In [None]:
# again, we will start by importing pandas package

# commonly used "nickname" for pandas is pd 

import pandas as pd

## DataFrames

**DataFrame** is a Python object with rows and columns that looks very similar to table in Excel or in a traditional SQL database:

<div>
<img src="attachment:dataframe.png" width="400"/>
</div>

Every row in a dataframe represents one observation (or data point) and every column represents one variable. 

If you're working with data, you will use data frames. 

There is more than one way to crate a data frame, and we will start by creating data frame from the scratch. 

In [None]:
# creating a data frame that's shown on picture

# we start by creating a dictionary with variable names as keys 
# and lists of values for those variables as values 

data = {
    'Name': ['John', 'Luke', 'Mia', 'Annie'], 
    'Age':[39, 17, 65, 27], 
    'Gender':['M', 'M', 'F', 'F'], 
    'Country':['USA', 'UK', 'UK', 'UK'],
    'Children':[1, 0, 4, 2], 
    'Car Brand':['Audi', 'Toyota', 'Audi', 'BMW']
}

# Create the pandas DataFrame 

df = pd.DataFrame(data)

df # to output our data frame, we just write a dataframe name

There are many more ways to create a data frame from scratch. You can read more about it here: https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/

However, usually you will have too big dataset to create DF manually. That's why the most common create a data frame is to read data from a database or a file. 

We will cover importing data from different files in Pandas part 2.

## Index 

Index in dataframe is an unique identifier of each row, that’s how any data point across the dataframe can be accessed. 

Often you will have another unique identifier in your dataframe (personnummer, employee id, transaction number...), and you can set that variable as index. 

In [None]:
# setting variable 'Name' as index using set_index(function)

df = df.set_index('Name')

df

In [None]:
# and we can always reset index back to numbers 0,1,2,3...

df = df.reset_index()

df

### Date Range

In [None]:
pd.date_range(start="2023-01-01", periods=4, freq="D")

In [None]:
# Create a datetime index

idx = pd.date_range(start="2023-01-01", periods=4, freq="D")
df.index = idx
df

In [None]:
df = pd.DataFrame(data=data, index=idx)
df

## Investigating data frames 

Pandas has many functions we can use to investigate our dataframes. 

We can access both basic information about dataframes and summary statistics.  

In [None]:
# tip: You can specify the number of rows you'd like to see by writing 
# a number of rows in the parenthesis: 

df.head(2)

In [None]:
# We can also view the last 5 rows of the data frame using the tail() function.

df.tail(2)

# again, you can specify number of rows in parenthesis

In [None]:
# number of rows and columns in dataframe

# we use function shape that returns tuple (nr of rows, nr of colums)

df.shape

In [None]:
# we can also unpack it into nr_rows, nr_columns

num_rows, num_cols = df.shape

print('Number of rows is:', num_rows)

print('Number of columns is:', num_cols)

In [None]:
# general information about columns

df.info()

In [None]:
# summary statistics of numeric columns

df.describe()

In [None]:
df.dtypes

In [None]:
type(df)

In [None]:
type(df.Name)

In [None]:
type(df.Name.to_frame())

## Accessing columns

Columns in a Pandas data frame can be accessed as `df.column`, `df['column']`, `loc[:, column_name]` or by using numeric indexing and the iloc selector data `.iloc[:, <column_number>]`.

In [None]:
df.columns # names of colums

In [None]:
# # there are multiple ways to acces gender column 

# we can write:

df.Gender

# this will not work for columns with spaces in their names - df.Car Brand

In [None]:
# this way will work regardless of column name 

df['Car Brand']

In [None]:
df.loc[:, "Car Brand"]

In [None]:
# we can also use iloc function - in this case, we don't use column name, we use column index 

# iloc takes two arguments - first is row index, and second is column index

# in this case, we want all rows, so we write : as first argument 

# we want Gender column, which is third column in df, 
# but since indexing stars at 0, index for gender is 2

df.iloc[:, 2]

In [None]:
# with iloc, we can also access specific rows 

# for example, if we're interested in a value in 3rd row and 5th column:

df.iloc[2, 4]

In [None]:
# let's check our df

df

In [None]:
# we can also access multiple columns - you need to write double square brackets 

df[['Name', 'Country', 'Car Brand']]

In [None]:
# we can also access multiple columns - you need to write double square brackets 

df.loc[:, ['Name', 'Country', 'Car Brand']]

In [None]:
# by selecting columns from the original dataframe we can create new data frame 

new_df = df[['Name', 'Country', 'Car Brand']]

new_df

In [None]:
# for specific column, we can calculate mean, standard deviation, sum, min, max,
# entire summary statistics... 

df['Age'].mean()

In [None]:
print(df['Age'].std()) 
print()
print(df['Children'].sum())
print()
print(df['Children'].max())
print()
print(df['Children'].min())
print()
print(df['Car Brand'].mode())

## Task

a) Create a data frame that looks like this:

<div>
<img src="../basic_df.png" width="150"/>
</div>


b) Calculate the following: 

1. Value in the fourth row and second column 
2. Hvilke verdier har vi 2023-01-03
3. Lagre verdien for 2023-01-03 kolonne a til en variabel


In [None]:
task_idx = pd.date_range(start="2023-01-01", freq="D", periods=5)
task_data = dict(a=[1,2,3,4,5], b=[10,20,30,40,50])
task_df = pd.DataFrame(task_data, index=task_idx)

task_df

In [None]:
print("Oppgave 1:")
print(task_df.iloc[3, 1])
print("Oppgave 2:")
print(task_df.loc["2023-01-03"])
value_of_a = task_df.loc["2023-01-03", "a"]

In [None]:
#### solution ####

# we start by creating a dictionary with variable names as keys 
# and lists of values for those variables as values 

import pandas as pd

# Create the pandas DataFrame 



## Merging data frames 

Often your data comes from multiple sources and it is saved in more than one data frame. That means you need to merge these dataframes into one dataframe. 

Merging dataframes in Python is very similar to joins in SQL. 


<div>
<img src="attachment:joins.png" width="350"/>
</div>

Image source: https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d

There are two functions in Python that we can use for joining dataframes - merge() and join(). 

Through this course we will use **merge()** function. 

In [None]:
# let's create two datasets that we will merge

transaksjoner = pd.read_csv("../data/logs.csv")
print(transaksjoner.shape)
transaksjoner.head(2)

In [None]:
produkter = pd.read_csv("../data/stock.csv")
print(produkter.shape)
produkter.head(2)

In [None]:
produkter.info()

## Deleting variables

We have noticed that merging dataframes can lead to duplicated columns. There are also many other reasons for wanting to delete variables from the dataframe, so it is important we know how to do it.  

In [None]:
transaksjoner.drop(columns=["Unnamed: 0"]).head(2)

In [None]:
transaksjoner = transaksjoner.drop(columns=["Unnamed: 0"])

In [None]:
produkter = produkter.drop(columns=["Unnamed: 0"])

In [None]:
# for joining dataframes in Python we use merge() function

# we need to precise which type of join we want (inner, outer, left or right)

# and which variable(s) should be used as a joining key 

pd.merge(transaksjoner, produkter, how='inner', on='stockId')

## Oppgave 1

Vi ønsker egentlig kun å ha med oss navn og volume fra produkt df-en

- Lag en df som kun inneholder det vi trenger for å merge sammen produktnavn og volume med transaksjons df-en

In [None]:
# Her kan du fylle inn kode for å lage df-en

cols = ["stockName", "volume", "stockId"]
produktnavn = produkter[cols]
print(produktnavn.shape)
produktnavn.head(2)

In [None]:
# Her kan du fylle inn kode for å lage df-en

produktnavn = 

In [None]:
merged_df = pd.merge(transaksjoner, produktnavn, how='inner', on=["stockId"])
print(merged_df.shape)
merged_df.head(2)

## Renaming variables 

Often you would want to rename your variables. 

In [None]:
merged_df = merged_df.rename(columns={"timestamp": "logTimestamp"})
merged_df.tail(2)

## Oppgave 2

- Last inn filen som heter groups.csv i mappen data og merge den med den eksisterende data framen merged_df
- Den kolonnen vi er ute etter er gruppenavn

In [None]:
grupper = pd.read_csv("../data/groups.csv")
grupper = grupper[["groupId", "groupName"]]
print(grupper.shape)
grupper.head(2)

In [None]:
# Merge den inn i merged_df
merged_df = pd.merge(merged_df, grupper, how='inner', on=["groupId"])
print(merged_df.shape)
merged_df.head(2)

In [None]:
# Merge den inn i merged_df
merged_df = 

## Deleting variables

We have noticed that merging dataframes can lead to duplicated columns. There are also many other reasons for wanting to delete variables from the dataframe, so it is important we know how to do it.  

In [None]:
merged_df.drop(columns=["timestamp", "logoPath"]).head(2)

In [None]:
merged_df = merged_df.drop(columns=["timestamp", "logoPath"])

## Saving DataFrame

In [None]:
merged_df.to_csv("../data/utfyllende_transaksjon.csv", index=False)

## Concatenating dataframes 

Sometimes we may also have the same data in more datframes and we would like to concatenate them. 

**Merge()** is used to combine two (or more) dataframes on the basis of values of common columns (or index), and **concat()** is used to append one (or more) dataframes one below the other.

In [None]:
# creating three dataframes, first one with employees starting in January,
# second one with employees starting in February and third one with employees starting in March 

employees_jan = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Employee ID': 1876},
                              {'Name': 'Annie', 'Role': 'Data Scientist', 'Employee ID': 3234},
                              {'Name': 'James', 'Role': 'Developer', 'Employee ID': 6743}])  

employees_feb = pd.DataFrame([{'Name': 'John', 'Role': 'Analyst', 'Employee ID': 5432},
                              {'Name': 'Rebecca', 'Role': 'Product Manager', 'Employee ID': 9807},
                              {'Name': 'Peggy', 'Role': 'Data Scientist', 'Employee ID': 1253}])

employees_mar = pd.DataFrame([{'Name': 'Melanie', 'Role': 'IT manager', 'Employee ID': 4278},
                              {'Name': 'Michael', 'Role': 'Receptionist', 'Employee ID': 7549},
                              {'Name': 'Steven', 'Role': 'Developer', 'Employee ID': 2641}])

In [None]:
employees_jan

In [None]:
# concatenating dataframes

future_employees = pd.concat([employees_jan, employees_feb, employees_mar])

future_employees

In [None]:
# we need to fix the index 

# we can do it with reset_index function 

# or we can add ignore_index=True as an argument in concat function 

future_employees = pd.concat([employees_jan, employees_feb, employees_mar], ignore_index=True)

future_employees

### Sorting dataframes 

Dataframes can be sorted by values in columns.

In [None]:
# we use sort_values function that by default sorts data ascending

future_employees = future_employees.sort_values(by='Name')

future_employees

In [None]:
# we need to specify if we want to sort descending 

future_employees = future_employees.sort_values(by='Name', ascending=False)

future_employees

In [None]:
# if we want to sort by two variables: 

future_employees = future_employees.sort_values(by=['Role', 'Name'])

future_employees