## Day 56 & 57 of 100DaysOfCode 🐍
### Python Library - Pandas Basics, Series, Loading DataFrames (CSV, JSON), Analyzing DataFrames



#### **Pandas 🐼📊**

Pandas is a Python library used for working with data sets. It has the ability for analyzing, cleaning, exploring, and manipulating data.

#### **Purpose of Pandas 🐼📊**

- Pandas allows us to analyze big data and make conclusions based on statistical theories.

- Pandas can clean messy data sets, and make them readable and relevant.

#### **Installation of Pandas**

In [None]:
!pip install pandas

#### **Importing Pandas**

In [None]:
import pandas as pd

#### **Example**

In [None]:
# Creating a dictionary dataset
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

# Loading the dataset in Pandas
myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


#### **Checking Pandas Version**

In [None]:
import pandas as pd

print(pd.__version__)

1.5.3


#### **Pandas Series**

- A Pandas Series is like a column in a table.

- It is a one-dimensional array holding data of any type.



In [None]:
# Creating a simple Pandas Series from a list

a = [1, 3, 5]

myvar = pd.Series(a)

print(myvar)

0    1
1    3
2    5
dtype: int64


##### **Labels**

- If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

In [None]:
# Returning the third value of the Series

a = [11, 13, 15]

myvar = pd.Series(a)

print(myvar[2])

15


##### **Creating Labels**

- We can name our own labels with the `index` argument.


In [None]:
a = [1, 3, 5]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    3
z    5
dtype: int64


In [None]:
# Returning the value of "y"

a = [1, 3, 5]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar["y"])

3


##### **Key/Value Objects as Series**

- We can also use a key/value object, like a dictionary, when creating a Series.

In [None]:
# Creating a simple Pandas Series from a dictionary

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


In [None]:
# Creating a Series using only data from "day1" and "day2"

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

day1    420
day2    380
dtype: int64


##### **Exercise**

In [None]:
# Inserting the correct Pandas method to create a Series

pd.Series(mylist)

##### **Exercise**



In [None]:
# Inserting the correct syntax to return the first value of a Pandas Series called "myseries"

myseries[0]

##### **Exercise**

In [None]:
# Inserting the correct syntax to add the labels "x", "y", and "z" to a Pandas Series

pd.Series(mylist, index = ["x", "y", "z"])

#### **DataFrames**

- Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

- A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

- *Series* is like a *column*, a **DataFrame** is the **whole table**.



In [None]:
# Creating a simple Pandas DataFrame

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# Loading data into a DataFrame object
df = pd.DataFrame(data)

print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


##### **Locating Row**

- Pandas use the `loc` attribute to return one or more specified row(s).

In [None]:
# Returning row 0

print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


##### This example returns a Pandas Series.

In [None]:
# Returning row 0 and 1 (using a list of indexes)

print(df.loc[[1, 2]])

   calories  duration
1       380        40
2       390        45


##### **Locating Named Indexes**

- Use the named index in the `loc` attribute to return the specified row(s).



In [None]:
# Adding a list of names to give each row a name

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


In [None]:
# Returning "day2"

print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


#### **Loading Files Into a DataFrame**

- Pandas can load them into a DataFrame, when data sets are stored in a file.

In [None]:
# Loading a comma separated file (CSV file) into a DataFrame

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

# print(df.to_string()) - Returning the entire DataFrame

##### **Exercise**

In [None]:
# Inserting the correct Pandas method to create a DataFrame

pd.DataFrame(data)

##### **Exercise**

In [None]:
# Inserting the correct syntax to return the first row of a DataFrame

df.loc[0]

#### **Pandas Read CSV**

- A simple way to store big data sets is to use CSV files (comma separated files).

> Tip: use `to_string()` to print the entire DataFrame.

In [None]:
# Loading the CSV into a DataFrame

import pandas as pd

df = pd.read_csv('https://www.w3schools.com/python/pandas/data.csv.txt')

print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

##### **Exercise**

In [None]:
# Inserting the correct syntax to return the entire DataFrame

df.to_string()

##### **Exercise**

In [None]:
# Inserting the correct syntax for loading CSV files into a DataFrame

df.read_csv(data)

#### **Pandas Read JSON**

- JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

- JSON objects have the same format as Python dictionaries (JSON = Python Dictionary).

In [None]:
# Loading the JSON file into a DataFrame

import pandas as pd

df1 = pd.read_json('https://www.w3schools.com/python/pandas/data.js')

print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

##### **Exercise**

In [None]:
# Inserting the correct syntax for loading JSON files into a DataFrame

df.read_json(data)

#### **Analyzing DataFrames**

##### Viewing the Data

- One of the most used method for getting a quick overview of the DataFrame, is the `head()` method.

- The `head()` method returns the headers and a specified number of rows, starting from the top.

In [None]:
# Getting a quick overview by printing the first 10 rows of the DataFrame

import pandas as pd

df = pd.read_csv('https://www.w3schools.com/python/pandas/data.csv.txt')

print(df.head(10))

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0



##### If the number of rows is not specified, the `head()` method will return the top 5 rows.

In [None]:
# Getting overview of the first 5 rows of the DataFrame

import pandas as pd

df = pd.read_csv('https://www.w3schools.com/python/pandas/data.csv.txt')

print(df.head())

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0


- There is also a `tail()` method for viewing the last rows of the DataFrame.

- The `tail()` method returns the headers and a specified number of rows, starting from the bottom.

In [None]:
# Printing the last 5 rows of the DataFrame

import pandas as pd

df = pd.read_csv('https://www.w3schools.com/python/pandas/data.csv.txt')

print(df.tail())

     Duration  Pulse  Maxpulse  Calories
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4


##### **Info About the Data**

- The DataFrames object has a method called `info()`, that gives you more information about the data set.


In [None]:
# Printing information about the data

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


##### **Exercise**

In [None]:
# Inserting the correct syntax for returning the headers and the first 10 rows of a DataFrame

df.head(10)

##### **Exercise**

In [None]:
# The head() method returns the first rows, what method returns the last rows?

df.tail()