# Introduction

[Pandas](https://pandas.pydata.org/ "Pandas") is a Python library for analyzing, cleaning, processing and manipulating data.

Pandas stands for both PANel DAta and PANel Data AnalysiS.

|English|Nederlands|
|:-|:-|
|[Pandas](https://pandas.pydata.org/ "Pandas") is a Python library for analyzing, cleaning, processing and manipulating data.|[Pandas](https://pandas.pydata.org/ "Pandas") is een Python-biblitheek om data te analyseren, op te schonen, verwerken en manipuleren.|
|Pandas stands for both PANel DAta and PANel Data AnalysiS.|Pandas staat voor zowel PANel DAta (panel-gegevens) en PANel Data AnalysiS (panel-gegevensanalyse).|


[Pandas](https://pandas.pydata.org/ "Pandas") is a Python library for analyzing, cleaning, processing and manipulating data.

Pandas stands for both PANel DAta and PANel Data AnalysiS.

<span style="color:blue">[Pandas](https://pandas.pydata.org/ "Pandas") is een Python-biblitheek om data te analyseren, opschonen, verwerken en manipuleren<br><br>Pandas staat voor zowel PANel DAta (panel-gegevens) en PANel Data AnalysiS (panel-gegevensanalyse).</span>

<!-- Data Structure outline -->

Pandas have three data structures based on [NumPy](https://numpy.org/ "NumPy") data structures:

* Series
* DataFrame
* Panel (not meant to be covered in this course)


## Series

Series is a 1D array with homogeneous data, like a column of a table.


We can create an empty series:

In [2]:
# import the pandas library & alias as pd
import pandas as pd

# create an empty series
s = pd.Series()

  s = pd.Series()


We can initiate a series:

In [3]:
import pandas as pd

# create a list of grades
grades = [12, 17, 20, 13, 19, 14, 16]

# create a series based on the list of grades
s = pd.Series(grades)

# print the series
print(s)

0    12
1    17
2    20
3    13
4    19
5    14
6    16
dtype: int64


## access a series

We can access a series (i) by index or (ii) by label

### (i) access by index

We can access an item of a Series referring to its position.

Remember that the first element is at position 0.

In [4]:
# access the 1st element
first = s[0]

# print the first element
print(first)


12


In [5]:
# access a random element
random = s[3]

# print his element
print(random)

13


We can access the first few elements:

In [6]:
# access the first three elements
first_few = s[:3]

# print the first three elements 
print(first_few)

0    12
1    17
2    20
dtype: int64


We can access the last few elements:

In [7]:
# access the last three elements
last_few = s[-3:]

# print the last three elements
print(last_few)

4    19
5    14
6    16
dtype: int64


### (ii) access by label

We can create series with labels and access an item referring to the label.

We can access one item by referring to its label:

In [8]:
# create a series with labels 
my_var = pd.Series(grades, index=["a", "b", "c", "d", "e", "f", "g"])

# access a series by label 
item = my_var["a"]

# print the item that was accessed by label
print(item)

12


We can access multiple items referring to a list of index labels: 

In [9]:
# access multiple items 
items = my_var[["a","c","f"]]

# print multiple items
print(items)

a    12
c    20
f    14
dtype: int64


## DataFrame

DataFrame is a 2D data structure, like table with rows and columns.

DataFrame is a container of Series, while Panel is a container of DataFrame.

### create a DataFrame

We can create (i) an empty DataFrame, (ii) a DataFrame from lists or (iii) from series.


#### (i) create an empty DataFrame

We can create an empty DataFrame:

In [10]:
# create an empty DataFrame
df = pd.DataFrame()

#### (ii) create DataFrames from lists

We can create a DataFrame from an existing or new list:

In [11]:
# create a DataFrame from an existing list
df = pd.DataFrame(grades)

print(df)

    0
0  12
1  17
2  20
3  13
4  19
5  14
6  16


In [12]:
# create a list
students = ["Bob", "Alice", "Carol", "David", "Frank", "Grace", "Oscar"]

# load lists into a DataFrame
df = pd.DataFrame(grades, students)

print(df)

        0
Bob    12
Alice  17
Carol  20
David  13
Frank  19
Grace  14
Oscar  16


#### (iii) create DataFrames from Series

We can create a DataFrame from an existing or new Series:

In [13]:
# create data based on series
data = {
  "students": ["Bob", "Alice", "Carol", "David", "Frank", "Grace", "Oscar"],
  "grades": [12, 17, 20, 12, 19, 14, 16],
}

# load data into a DataFrame
df = pd.DataFrame(data)

print(df)

  students  grades
0      Bob      12
1    Alice      17
2    Carol      20
3    David      12
4    Frank      19
5    Grace      14
6    Oscar      16


### access a DataFrame

#### (i) access one or more rows

We can access a specific row:

In [14]:
# access the 1st row by referring to the row index
first = df.loc[0]
print(first)

print()

# access the 4th row by referring to the row index
fourth = df.loc[3]
print(fourth)

students    Bob
grades       12
Name: 0, dtype: object

students    David
grades         12
Name: 3, dtype: object


This returns a Series.

We can access more rows by using a list of indexes:

In [15]:
# access the 5th and 6th row by referring to the row index
indexes_list = [4,5]
rows = df.loc[indexes_list]
print(rows)

print()

# directly define the list of indexes in loc
rows = df.loc[[4,5]]
print(rows)

  students  grades
4    Frank      19
5    Grace      14

  students  grades
4    Frank      19
5    Grace      14


This returns a DataFrame.

Access the grades that are higher than a certain value

In [16]:
print(df.loc[df["grades"] > 15])

  students  grades
1    Alice      17
2    Carol      20
4    Frank      19
6    Oscar      16


In [17]:
print(df[df["grades"] > 15 ]) 

  students  grades
1    Alice      17
2    Carol      20
4    Frank      19
6    Oscar      16


#### (ii) access a named index

We can name our custom indexes with the `index` argument

In [18]:
# index list 
index = ["student1", "student2", "student3", "student4", "student5", "student6", "student7"]

# add named index labels into a DataFrame
df = pd.DataFrame(data, index)

print(df)

         students  grades
student1      Bob      12
student2    Alice      17
student3    Carol      20
student4    David      12
student5    Frank      19
student6    Grace      14
student7    Oscar      16


We can specify the named index in the loc attribute to return the specified row(s):

In [19]:
#refer to a named index:
row = df.loc["student4"]

print(row)

students    David
grades         12
Name: student4, dtype: object


### group data in a Data Frame

In [20]:
df_grouped = df.groupby("grades")

retrieve the first group *

In [21]:
df_grouped.first()

Unnamed: 0_level_0,students
grades,Unnamed: 1_level_1
12,Bob
14,Grace
16,Oscar
17,Alice
19,Frank
20,Carol


In [22]:
df.groupby(["students"])["grades"].count()

students
Alice    1
Bob      1
Carol    1
David    1
Frank    1
Grace    1
Oscar    1
Name: grades, dtype: int64

In [23]:
df.groupby(["grades"])["students"].count()

grades
12    2
14    1
16    1
17    1
19    1
20    1
Name: students, dtype: int64

In [24]:
df.groupby(["grades"])["grades"].count()

grades
12    2
14    1
16    1
17    1
19    1
20    1
Name: grades, dtype: int64

# DataFrames & Files

We can load data into a DataFrame from a file: *

In [25]:
# load data from a CSV file into a DataFrame
df_csv = pd.read_csv("grades.csv")

print(df_csv)

FileNotFoundError: [Errno 2] No such file or directory: 'grades.csv'

In [None]:
# load data from a JSON file into a DataFrame
df_json = pd.read_json("grades.json")

print(df_json)

If we have a large file, instead of using `print()`, we can use `to_string()`. 

Pandas will only return the first 5 rows, and the last 5 rows. *

In [None]:
# print a DataFrame with to_string()

print(df_csv.to_string())

#### head

The `head()` function returns the headers and a specified number of rows, starting from the top.

In [None]:
# print the first 3 rows of the CSV DataFrame
print(df_csv.head(3))

print()

# print the first 4 rows of the JSON DataFrame
print(df_json.head(4))

#### tail

The `tail()` function returns the headers and a specified number of rows, starting from the end.

In [None]:
# print the last 3 rows of the CSV DataFrame
print(df_csv.tail(3))
p
# print the last 4 rows of the JSON DataFrame
print(df_json.tail(4))

#### info

The `info()` function gives you more information about the data set.

In [None]:
# print more information about the CSV DataFrame
print(df_csv.info())

print()

# print more information about the JSON DataFrame
print(df_json.info())