### Before we start:
* Final assignment will be handed out in 2 weeks 😊

# 4.1 Data Entry with Pandas

A few words about data:

"Information collected through observation"

* Qualitative
* Quantitative
* Unstructured <> semi-structured <> structured


# Pandas

Python is well suited to handle basic data manipulation tasks, Pandas however offers a higher-level interface specifically designed for data management and analysis. It provides a more efficient, intuitive, and expressive way to work with tabular data. It is one of the most popular and widely used Python libraries for data science.

https://pandas.pydata.org/docs/user_guide/10min.html

In [None]:
# Installed as part of Anaconda (if done according to class)
# If not; install using "pip install pandas"
!pip install pandas

# Typical to import pandas with alias "pd"
import pandas as pd

## Series and DataFrame

* A *Series* in Pandas is a one-dimensional array holding data of any type. 
* A *DataFrame* in Pandas is a 2 dimensional array, or a table with rows and columns.

Data sets in Pandas are usually multi-dimensional tables (DataFrames). A Series is like a column, a DataFrame is the whole table.

### Series

In [None]:
# from pandas import Series, DataFrame

my_series = pd.Series(["one", "two", "three", "four", "five"])
print(my_series)

# Heads up! If not epecified, values are labeled with default index numbers starting on 0

In [None]:
# Values of a series
my_series.values

In [None]:
# Index of a series
my_series.index

In [None]:
# Get a single element using the index
print(my_series[1])

In [None]:
# Add explicit indexes

my_series = pd.Series(["one", "two", "three", "four", "five"], ["a", "b", "c", "d", "e"])
print(my_series)

In [None]:
# Assignments (data manipulation)

my_series["b"] += "_more"
print(my_series)

In [None]:
# Get subset of data (filtering)

my_series = my_series[["a", "b", "d", "e"]]
print(my_series)

In [None]:
# Filter on conditions (filtering)

my_series = my_series[my_series != "four"]
print(my_series)

In [None]:
# Manipulate all data

my_series = my_series * 2
print(my_series)

my_lambda = lambda s : s * 2
my_series = my_lambda(my_series)
print(my_series)

In [None]:
# Check for existence (key)

print("a" in my_series)
print("g" in my_series)

# More examples on search (value)
print("one" in my_series.values)
print("fifteen" in my_series.values)

# Pick non-existing element with default
print(my_series.get("x", default="non-existing"))

In [None]:
# Simplify it a bit, use a dict

cities = {'Oslo':634293, 'Bergen':271949, 'Kristiansand':85983}
city_series = pd.Series(cities)
print(city_series)

In [None]:
# Restrict to what you want

cities = {'Oslo':634293, 'Bergen':271949, 'Kristiansand':85983}
city_series = pd.Series(cities, ["Oslo", "Bergen"]) # Kristiansand is omitted
# Adding value
# Heads up! using same key overrides any existing value! No Warning!
city_series["New York"] = 8000000
print(city_series)

In [None]:
# Properly name it all
cities = {'Oslo':634293, 'Bergen':271949, 'Kristiansand':85983}
city_series = pd.Series(cities, name='Population')
# city_series.name = "Population"
city_series.index.name = "City"
print(city_series)

### DataFrame
* Support named/labeled rows & columns
* Can perform operations on rows and columns
* Support reading and writing JSON, CSV, Excel etc.

In [None]:
# Simple example

names = ["Kari", "Ola", "Per"]
ages = [50, 40, 45]
professions = ["Doctor", "Bus Driver", "Engineer"]

person_dataframe = pd.DataFrame(list(zip(names, ages, professions)))
print(person_dataframe)


In [None]:
# Name columns (simple example)

person_dataframe = pd.DataFrame(list(zip(names, ages, professions)), columns=["Name", "Age", "Profession"])
print(person_dataframe)

In [None]:
# Start index at another value

person_dataframe.index += 1
print(person_dataframe)


In [None]:
# Custom index

person_dataframe.index = ["a", "b", "c"]
print(person_dataframe)

In [None]:
# Pick only a few elements

# From the top
print(person_dataframe.head(1))

# From the end
print(person_dataframe.tail(1))


In [None]:
# Check data types

print(person_dataframe.dtypes)
print(person_dataframe["Age"][0].dtype)

In [None]:
# Filtering

over_40_dataframe = person_dataframe[person_dataframe["Age"] > 40]
print(over_40_dataframe)

************

In [None]:
# A few other examples (perhaps better?)

persons_dict = {
    "Name": ["Kari", "Ola", "Per"],
    "Age": [50, 40, 45],
    "Profession": ["Doctor", "Bus Driver", "Engineer"]
}

person_dataframe = pd.DataFrame(persons_dict)
#person_dataframe

person_dataframe = pd.DataFrame(persons_dict, index=["A", "B", "C"])
person_dataframe

In [None]:
# Only certain columns

person_dataframe = pd.DataFrame(persons_dict, index=["A", "B", "C"], columns=["Name", "Profession"])
person_dataframe


In [None]:
# Adding column

seniority = [15, 8, 12.5]
person_dataframe["Seniority"] = seniority
person_dataframe


In [None]:
# Manipulating a column

#person_dataframe["Seniority"] += 1

person_dataframe = person_dataframe.assign(Seniority = lambda s: (s['Seniority'] + 1))

person_dataframe


In [None]:
# 2 dimentional array as data input

epl_league_result = [
    ["Manchester City", 45, 89],
    ["Arsenal", 28, 84],
    ["Manchester United", 34, 75]
]
epl_dataframe = pd.DataFrame(epl_league_result, columns=["Team", "Goal diff.", "Points"])
epl_dataframe.index += 1 
epl_dataframe

# epl_league_result = {
#     "Team": ["Manchester City", 45, 89],
#     "Goal diff.": ["Arsenal", 28, 85],
#     "Points": ["Manchester United", 34, 79]
# }

### Write to & read from file (I/O)

A multitude of functions and formats siupported, but we will focus on two of the most common:

* to_json & read_json
* to_csv & read_csv

https://pandas.pydata.org/docs/reference/io.html


In [None]:
# JSON

# Write
file_name_json = "EPL Table 2022-23.json"
epl_dataframe.to_json(file_name_json, indent=4)

In [None]:
# Read
dataframe_from_json_file = pd.read_json(file_name_json)
dataframe_from_json_file

In [None]:
# CSV

#Write
file_name_csv = "EPL Table 2022-23.csv"
epl_dataframe.to_csv(file_name_csv, sep=";", index=False)

In [None]:
# Read
dataframe_from_csv = pd.read_csv(file_name_csv, sep=";")
dataframe_from_csv

************

In [None]:
# One final example
import json
import requests

json_url = "https://dummyjson.com/products/category/smartphones"
json_result = requests.get(json_url).json()

product_dataframe = pd.DataFrame.from_dict(json_result["products"])
product_dataframe

product_dataframe.to_excel("Smartphones.xlsx")