<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">

## <center> Advanced Pandas

## Intro to Dataframes

In [1]:
import pandas as pd

In [2]:
# Defining a regular Python dictionary

scores = {"name" : ['Ray', 'Japhy', 'Zosa'],
          "city" : ['San Francisco', 'San Francisco', 'Denver'],
          "score" : [75, 92, 94]
          }

In [3]:
scores

{'name': ['Ray', 'Japhy', 'Zosa'],
 'city': ['San Francisco', 'San Francisco', 'Denver'],
 'score': [75, 92, 94]}

In [4]:
# Converting the dictionary to a Pandas Dataframe

df = pd.DataFrame(scores)

In [5]:
df

Unnamed: 0,name,city,score
0,Ray,San Francisco,75
1,Japhy,San Francisco,92
2,Zosa,Denver,94


Notice that indices were created when the dictionary was converted to a Pandas Dataframe.

**Pandas DataFrame** is a **two-dimensional labeled data structure** in Python's Pandas library. It is a table-like data structure where each column can have a different data type (such as numerical, categorical, or date/time) and each row is labeled with an *index*.

DataFrames can be thought of as a **spreadsheet** or **SQL table**, with rows and columns of data, but with the added functionality of powerful data manipulation and analysis tools. They can be created from a variety of data sources, including CSV files, SQL databases, and Python dictionaries.

Some of the key features of Pandas DataFrames include **indexing and selection**, **filtering**, **aggregation and grouping**, **merging and joining**, and **handling missing data**. They are commonly used for data exploration, cleaning, transformation, and analysis in data science and machine learning applications.

In [8]:
# Checking individual columns. "df.score" could also be used and produce the same result

df['score']

0    75
1    92
2    94
Name: score, dtype: int64

In [17]:
# Creating a new column by combining values from an existing in a different column

df['name_city'] = df['name'] + df['city']
df['name_city']

0      RaySan Francisco
1    JaphySan Francisco
2            ZosaDenver
Name: name_city, dtype: object

In [19]:
# Filtering through the dataframe by selecting only rows with score greater than 90

df[df['score']>90]

Unnamed: 0,name,city,score,name_city
1,Japhy,San Francisco,92,JaphySan Francisco
2,Zosa,Denver,94,ZosaDenver


## Top Pandas Functions

### Eploring your Data

In [21]:
# Importing CSV as Pandas DataFrame

df_iris = pd.read_csv('datasets/iris.csv')

In [28]:
#Checking the first 3 rows through .head function. Returns 5 rows when not defined

df_iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [24]:
# Checking the dimensions of data

df_iris.shape

(150, 5)

In [29]:
# Checking the bottom rows

df_iris.tail(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [27]:
# Checking the datatypes assigned by Pandas to each of the columns

df_iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

### Subsetting your data with loc and iloc functions

In [30]:
df_iris.loc[3:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa


In [31]:
df_iris.loc[3,'sepal_length']

4.6

In [32]:
#Locating data

df_iris.iloc[3,0]

4.6

In [33]:
#Exporting data as csv

df_iris.to_csv('iris-output.csv', index=False)

### Options

In [34]:
emissions = pd.DataFrame({"country":['China','United States','India'],
          "year":['2018','2018','2018'],
          "co2_emissions":[10060000000.0,5410000000.0,2650000000.0]})

In [35]:
pd.set_option('max_rows',2)
emissions

OptionError: 'Pattern matched multiple keys'