# Pandas

Essential functions to apply data analysis through the Pandas library in Python.

https://pandas.pydata.org/

In this section we will explore the exploratory data analysis, since how load the data and until how search through them. 

## Load data

Pandas has some methods to load the data in different shapes in a DataFrame 
We will use Iris dataset and the National Project, the data dictionaries can be found in these URLs:
+ https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names
+ https://www.kaggle.com/nationalparkservice/park-biodiversity/data


In [69]:
import pandas as pd

# the next code will load the data but without the column names, so in the second line we can define them
#df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                 header=None, names=["sepal_length","sepal_width", "petal_length","petal_width","class"])

# the idea with this dataset is to explore some elements through pandas that Iris does not provide, 
# such as categorical variables, specifically, the method provides with index_col to select what is the main column to identify
# each row
df_park = pd.read_csv('../datasets/parks.csv', index_col=['Park Code'], encoding='utf-8')


Another way to save the dataset in your repository or refresh it

In [65]:
import requests 
import csv
data = requests.get('')
with open("../datasets/name.csv", "w+") as f:
    writer = csv.writer(f)
    reader = csv.reader(data.text.splitlines())
    for row in reader:
        writer.writerow(row)
                    

## Exploration

Print the first three rows

In [40]:
df.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


In [70]:
df_park.head(3)

Unnamed: 0_level_0,Park Name,State,Acres,Latitude,Longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ACAD,Acadia National Park,ME,47390,44.35,-68.21
ARCH,Arches National Park,UT,76519,38.68,-109.57
BADL,Badlands National Park,SD,242756,43.75,-102.5


Get and print a specific row

In [41]:
df.iloc[0]

sepal_length            5.1
sepal_width             3.5
petal_length            1.4
petal_width             0.2
class           Iris-setosa
Name: 0, dtype: object

Get values through your dataframe's indexes:
+ loc() receives a string index or an array of string indexes
+ iloc() receives the position index or an array of position indexes (int)

In [80]:
print(df_park.loc["ACAD"])
print("")
print(df_park.loc[["ACAD","ARCH"]])
print("")
print(df_park.iloc[[1,2]])

Park Name    Acadia National Park
State                          ME
Acres                       47390
Latitude                    44.35
Longitude                  -68.21
Name: ACAD, dtype: object

                      Park Name State  Acres  Latitude  Longitude
Park Code                                                        
ACAD       Acadia National Park    ME  47390     44.35     -68.21
ARCH       Arches National Park    UT  76519     38.68    -109.57

                        Park Name State   Acres  Latitude  Longitude
Park Code                                                           
ARCH         Arches National Park    UT   76519     38.68    -109.57
BADL       Badlands National Park    SD  242756     43.75    -102.50


The next lines determine the number of rows and columns in the dataset, the specific number of rows and how get the column names.


In [89]:
print(df.shape)
print("")
print(len(df))
print("")
print(df_park.columns)
print("")
print(df.columns)

(150, 5)

150

Index(['Park Name', 'State', 'Acres', 'Latitude', 'Longitude'], dtype='object')

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')


Get the data by column name and the number of rows to display

In [94]:
print(df_park['State'][:2])
print("")
print(df['sepal_length'][:5])
print("")
#see how the columns are maped in the data_frame and how we can call it
print(df_park.Acres.head(2))

Park Code
ACAD    ME
ARCH    UT
Name: State, dtype: object

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64

Park Code
ACAD    47390
ARCH    76519
Name: Acres, dtype: int64


In our dataset of parks the column "Park Name" a space sepa