# 1 Info

This is a compiled of pandas and python code examples that will be usefull when coding. the focus will be pandas but some python code examples may appear.

# 2 Examples

    2.1 pd.loc[]

    2.2 paths

    2.3 Improve pandas efficiency

In [2]:
import pandas as pd

## 2.1 pd.loc[]

https://sparkbyexamples.com/pandas/pandas-dataframe-loc/

loc is used to select rows and columns by names/labels of pandas DataFrame.


df.loc[START:STOP:STEP,START:STOP:STEP]
        
df.loc[rows,columns]

    * START : the name of the row/column label
    * STOP : is the name of the last row/column label to take
    * STEP : is the number of indices to advance after each iteration

In [3]:
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas"],
    'Fee' :[20000,25000,26000,22000,24000],
    'Duration':['30day','40days','35days','40days','60days'],
    'Discount':[1000,2300,1200,2500,2000]
}

df = pd.DataFrame(technologies)

In [5]:
# select single row
df.loc[2]

Courses     Hadoop
Fee          26000
Duration    35days
Discount      1200
Name: 2, dtype: object

In [6]:
# select single column
df.loc[:,"Courses"]

0      Spark
1    PySpark
2     Hadoop
3     Python
4     pandas
Name: Courses, dtype: object

In [7]:
# select multiple rows
df.loc[[3,4]]

Unnamed: 0,Courses,Fee,Duration,Discount
3,Python,22000,40days,2500
4,pandas,24000,60days,2000


In [8]:
# select multiple columns
df.loc[:,["Courses","Fee"]]

Unnamed: 0,Courses,Fee
0,Spark,20000
1,PySpark,25000
2,Hadoop,26000
3,Python,22000
4,pandas,24000


In [9]:
# select rows range
df.loc[1:3]

Unnamed: 0,Courses,Fee,Duration,Discount
1,PySpark,25000,40days,2300
2,Hadoop,26000,35days,1200
3,Python,22000,40days,2500


In [10]:
# select columns range
df.loc[:,"Fee":"Discount"]

Unnamed: 0,Fee,Duration,Discount
0,20000,30day,1000
1,25000,40days,2300
2,26000,35days,1200
3,22000,40days,2500
4,24000,60days,2000


In [12]:
# Select alternate rows
df.loc[0:4:2]

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,20000,30day,1000
2,Hadoop,26000,35days,1200
4,pandas,24000,60days,2000


In [15]:
# select alternate columns
df.loc[:,"Courses":"Discount":2]

Unnamed: 0,Courses,Duration
0,Spark,30day
1,PySpark,40days
2,Hadoop,35days
3,Python,40days
4,pandas,60days


In [16]:
# Using condition
df.loc[df["Fee"]>=24000]

Unnamed: 0,Courses,Fee,Duration,Discount
1,PySpark,25000,40days,2300
2,Hadoop,26000,35days,1200
4,pandas,24000,60days,2000


In [24]:
# Using lambda 
df.loc[lambda x:x["Discount"] > 2100]

Unnamed: 0,Courses,Fee,Duration,Discount
1,PySpark,25000,40days,2300
3,Python,22000,40days,2500


## 2.2 Working with paths

In [25]:
# we will use the library pathlib
from pathlib import Path

Path.cwd() will return the path for the current folder

In [26]:
Path.cwd()

PosixPath('/Users/sergiososabautista/Library/CloudStorage/GoogleDrive-sergio.sosa.py8@gmail.com/My Drive/python')

There are paths that can be created from strings

In [29]:
path_string = Path("Sergio_User")
path_string

PosixPath('Sergio_User')

## 2.3 Improve pandas efficiency

https://towardsdatascience.com/how-to-enhance-your-pandas-code-dont-wait-no-more-5fb89bc1ece9

Two bassic tips:
        
        Use only what you need

        Choose the right variable type

### 2.3.1 Avoid loops at all costs

### 2.3.2 If you need loops, do not use iterrows()

Try different options like "apply()", "list comprehension", and in specific itertuples().

with itertuples() the field can be access by their names with the "."

* Below we can see an example of how much the speed is improve.

In [33]:
%%time

df = pd.read_csv('./data/iris.csv')

def subtract_open_from_close(df):
    subtracted_price = []
    for i,row in df.iterrows():
        subtracted_price.append(row['sepal_length'] - row['sepal_width'])
    
    df['SubtractedPrice'] = subtracted_price
    return df
        
data = subtract_open_from_close(df)

CPU times: user 7.71 ms, sys: 1.03 ms, total: 8.74 ms
Wall time: 7.85 ms


In [35]:
%%time

df = pd.read_csv('./data/iris.csv')

def subtract_open_from_close(df):
    subtracted_price = []
    for tup in df.itertuples():
        subtracted_price.append(tup.sepal_length - tup.sepal_width)
    
    df['SubtractedPrice'] = subtracted_price
    return df
        
data = subtract_open_from_close(df)

CPU times: user 3.68 ms, sys: 1.64 ms, total: 5.31 ms
Wall time: 4.3 ms


### 2.3.3 Use query on large datasets

The improvment will be notice even more in large datasets

In [37]:
%%time
df[df['sepal_length']>1]

CPU times: user 1.32 ms, sys: 2.11 ms, total: 3.43 ms
Wall time: 6.29 ms


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,SubtractedPrice
0,5.1,3.5,1.4,0.2,Iris-setosa,1.6
1,4.9,3.0,1.4,0.2,Iris-setosa,1.9
2,4.7,3.2,1.3,0.2,Iris-setosa,1.5
3,4.6,3.1,1.5,0.2,Iris-setosa,1.5
4,5.0,3.6,1.4,0.2,Iris-setosa,1.4
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,3.7
146,6.3,2.5,5.0,1.9,Iris-virginica,3.8
147,6.5,3.0,5.2,2.0,Iris-virginica,3.5
148,6.2,3.4,5.4,2.3,Iris-virginica,2.8


In [38]:
%%time
df.query('sepal_length > 1')

CPU times: user 2.49 ms, sys: 1.52 ms, total: 4.01 ms
Wall time: 6.2 ms


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,SubtractedPrice
0,5.1,3.5,1.4,0.2,Iris-setosa,1.6
1,4.9,3.0,1.4,0.2,Iris-setosa,1.9
2,4.7,3.2,1.3,0.2,Iris-setosa,1.5
3,4.6,3.1,1.5,0.2,Iris-setosa,1.5
4,5.0,3.6,1.4,0.2,Iris-setosa,1.4
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,3.7
146,6.3,2.5,5.0,1.9,Iris-virginica,3.8
147,6.5,3.0,5.2,2.0,Iris-virginica,3.5
148,6.2,3.4,5.4,2.3,Iris-virginica,2.8


### 2.3.4 Don't default to csv

parquet.gzip is much faster than csv

In [39]:
%%time
df.to_csv('./data/df.csv', index=False)

CPU times: user 4.34 ms, sys: 4.62 ms, total: 8.97 ms
Wall time: 19 ms


In [40]:
%%time
df.to_csv('./data/df.parquet.gzip', index=False)

CPU times: user 1.65 ms, sys: 1.56 ms, total: 3.2 ms
Wall time: 2.48 ms
