# Pandas tricks

At the link is the full article, below we can find the ones that I found most interesting
https://towardsdatascience.com/pandas-and-python-tips-and-tricks-for-data-science-and-data-analysis-1b1e05b7d93a

* **Apply and Lambda**

* .cut() Convert categorical data into numerical ones

* **Query() Select rows from a Pandas Dataframe based on column(s) values**

* Deal with zip files

* Select a subset of your pandas dataframe witg specific column types

* **comment, Remove or split a column based on a character**

* **to_string() Print Pandas dataframe in Tabular format from consol**

* **df.style  Highlight data points in Pandas**

* Reduce decimal points in your data

* Replace some values in your data frame

* **.compare(), Compare two data frames and get their differences**

* Get a subset of a very large dataset for quick analysis

* .melt(), unpivot, Transform your data frame from a wide to a long format

* Reduce the size of your Pandas data frame by ignoring the index when saving

* **Parquet instead of CSV**

* Transform your data frame into a markdown

* Format Date Time column

In [None]:
import pandas as pd

# Create the dataframe
candidates= {
    'Name':["Aida","Mamadou","Ismael","Aicha","Fatou", "Khalil"],
    'Degree':['Master','Master','Bachelor', "PhD", "Master", "PhD"],
    'From':["Abidjan","Dakar","Bamako", "Abidjan","Konakry", "Lomé"],
    'Years_exp': [2, 3, 0, 5, 4, 3],
    'From_office(min)': [120, 95, 75, 80, 100, 34]
          }
candidates_df = pd.DataFrame(candidates)

In [None]:
candidates_df.head(2)

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min)
0,Aida,Master,Abidjan,2,120
1,Mamadou,Master,Dakar,3,95


## Apply and Lambda

𝙙𝙛[𝙣𝙚𝙬_𝙘𝙤𝙡] = 𝙙𝙛.𝙖𝙥𝙥𝙡𝙮(𝙡𝙖𝙢𝙗𝙙𝙖 𝙧𝙤𝙬: 𝙛𝙪𝙣𝙘(𝙧𝙤𝙬), 𝙖𝙭𝙞𝙨=1) 


➡ 𝙛𝙪𝙣𝙘 is the function you want to apply to your data frame.

➡ 𝙖𝙭𝙞𝙨=1 to apply the function to each row in your data frame.

In [None]:
def candidate_info(row):
    """custom function"""
    # Select columns of interest 
    name = row.Name 
    is_from = row.From
    year_exp = row.Years_exp
    degree = row.Degree
    from_office = row["From_office(min)"]

    # Generate the description from previous variables
    info = f"""{name} from {is_from} holds a {degree} degree with {year_exp} year(s) experience and lives {from_office} from the office"""

    return info

# -------Application of the function to the data ------

candidates_df["Description"] = candidates_df.apply(lambda row: candidate_info(row), axis=1)

In [None]:
candidates_df.head(2)

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree with 2...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree with ...


## Query

Select rows from a Pandas Dataframe based on column(s) values

➡ use .𝙦𝙪𝙚𝙧𝙮() function by specifying the filter condition.

➡ the filter expression can contain any operators (<, >, ==, !=, etc.)

➡ use the @̷ sign to use a variable in the expression.

In [None]:
candidates_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree with 2...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree with ...
2,Ismael,Bachelor,Bamako,0,75,Ismael from Bamako holds a Bachelor degree wit...
3,Aicha,PhD,Abidjan,5,80,Aicha from Abidjan holds a PhD degree with 5 y...
4,Fatou,Master,Konakry,4,100,Fatou from Konakry holds a Master degree with ...
5,Khalil,PhD,Lomé,3,34,Khalil from Lomé holds a PhD degree with 3 yea...


In [None]:
# Get all the candidates with a Master degree
candidates_df.query("Degree == 'Master'")

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree with 2...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree with ...
4,Fatou,Master,Konakry,4,100,Fatou from Konakry holds a Master degree with ...


In [None]:
# Get non bachelor candidates
candidates_df.query("Degree != 'Bachelor'")

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree with 2...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree with ...
3,Aicha,PhD,Abidjan,5,80,Aicha from Abidjan holds a PhD degree with 5 y...
4,Fatou,Master,Konakry,4,100,Fatou from Konakry holds a Master degree with ...
5,Khalil,PhD,Lomé,3,34,Khalil from Lomé holds a PhD degree with 3 yea...


In [None]:
# Get values from list
list_locations = ["Abidjan", "Dakar"]
candidates_df.query("From in @list_locations")

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree with 2...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree with ...
3,Aicha,PhD,Abidjan,5,80,Aicha from Abidjan holds a PhD degree with 5 y...


## comment, Remove or split a column based on a character

This can be done on the fly while loading your pandas dataframe using the 𝙘𝙤𝙢𝙢𝙚𝙣𝙩 parameter as follow:

➡ 𝚌𝚕𝚎𝚊𝚗_𝚍𝚊𝚝𝚊 = 𝚙𝚍.𝚛𝚎𝚊𝚍_𝚌𝚜𝚟(𝚙𝚊𝚝𝚑_𝚝𝚘_𝚍𝚊𝚝𝚊, 𝙘𝙤𝙢𝙢𝙚𝙣𝙩=’𝚜𝚢𝚖𝚋𝚘𝚕’)

➡ if I want to create a new column for those comments and still remove them from the application date column? An illustration is the second scenario.

In [None]:
# Read my messy dataset
messy_df = pd.read_csv("./data/candidates_data.csv")

# FIRST SCENARIO -> REMOVE COMMENTS
clean_df = pd.read_csv("./data/candidates_data.csv", comment='#')

# SECOND SCENARIO -> CREATE NEW COLUMN FOR COMMENTS
messy_df[['application_date', 'comment']] = messy_df['application_date'].str.split('#', 1, expand=True)


In [None]:
clean_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),application_date
0,Aida,Master,Abidjan,2,120,13/07/2015
1,Mamadou,Master,Dakar,3,95,26/09/2015
2,Ismael,Bachelor,Bamako,0,75,09/02/2015
3,Aicha,PhD,Abidjan,5,80,29/10/2014
4,Fatou,Master,Konakry,4,100,30/12/2014
5,Khalil,PhD,Lomé,3,34,03/05/2015


In [None]:
messy_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),application_date,comment
0,Aida,Master,Abidjan,2,120,13/07/2015,Aida from Abidjan holds a Master degree with ...
1,Mamadou,Master,Dakar,3,95,26/09/2015,Mamadou from Dakar holds a Master degree with...
2,Ismael,Bachelor,Bamako,0,75,09/02/2015,Ismael from Bamako holds a Bachelor degree wi...
3,Aicha,PhD,Abidjan,5,80,29/10/2014,Aicha from Abidjan holds a PhD degree with 5 ...
4,Fatou,Master,Konakry,4,100,30/12/2014,Fatou from Konakry holds a Master degree with...
5,Khalil,PhD,Lomé,3,34,03/05/2015,Khalil from Lomé holds a PhD degree with 3 ye...


## to_string() inside Print, Pandas dataframe in Tabular format from consol

In [None]:
data_URL = "https://raw.githubusercontent.com/keitazoumana/Experimentation-Data/main/vgsales.csv" 

# Read your dataframe
video_game_data = pd.read_csv(data_URL)

"""
Printing without to_string() function
"""
print(video_game_data.head())

   Rank                      Name Platform    Year         Genre Publisher  \
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo   
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo   
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo   
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.49     29.02      3.77         8.46         82.74  
1     29.08      3.58      6.81         0.77         40.24  
2     15.85     12.88      3.79         3.31         35.82  
3     15.75     11.01      3.28         2.96         33.00  
4     11.27      8.89     10.22         1.00         31.37  


In [None]:
"""
Printing with to_string() function
"""
print(video_game_data.head().to_string())

   Rank                      Name Platform    Year         Genre Publisher  NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo     41.49     29.02      3.77         8.46         82.74
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo     29.08      3.58      6.81         0.77         40.24
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo     15.85     12.88      3.79         3.31         35.82
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo     15.75     11.01      3.28         2.96         33.00
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo     11.27      8.89     10.22         1.00         31.37


## df.style  Highlight data points in Pandas

some of the main features are

✨ 𝚍𝚏.𝚜𝚝𝚢𝚕𝚎.𝚑𝚒𝚐𝚑𝚕𝚒𝚐𝚑𝚝_𝚖𝚊𝚡() to assign a color to the maximum value of each column.

✨ 𝚍𝚏.𝚜𝚝𝚢𝚕𝚎.𝚑𝚒𝚐𝚑𝚕𝚒𝚐𝚑𝚝_𝚖in() to assign a color to the minimum value of each column.

✨ 𝚍𝚏.𝚜𝚝𝚢𝚕𝚎.𝚊𝚙𝚙𝚕𝚢(𝚖𝚢_𝚌𝚞𝚜𝚝𝚘𝚖_𝚏𝚞𝚗𝚌𝚝𝚒𝚘𝚗) to apply your custom function to your data frame.

In [None]:
my_info = {
    "Salary": [100000.2, 95000.9, 103000.2, 65984.1, 150987.08], 
    "Height": [6.5, 5.2, 5.59, 6.7, 6.92], 
    "weight": [185.23, 105.12, 110.3, 190.12, 200.59]      
}
my_data = pd.DataFrame(my_info)

In [None]:
"""
Function to highlight min and max
"""

def highlight_min_max(data_frame, min_color, max_color):

  # This first line create a styler object
  final_data = data_frame.style.highlight_max(color = max_color)

  # On this second line, no need to use ".style"
  final_data = final_data.highlight_min(color = min_color)

  return final_data
  
# Function to apply ORANGE to min and GREEN to max
highlight_min_max(my_data, min_color='orange', max_color='green')

Unnamed: 0,Salary,Height,weight
0,100000.2,6.5,185.23
1,95000.9,5.2,105.12
2,103000.2,5.59,110.3
3,65984.1,6.7,190.12
4,150987.08,6.92,200.59


In [None]:
"""
Custom function: apply RED or GREEN whether data is below or above the mean. 
"""
def highlight_values(data_row):
  low_value_color = "background-color:#C4606B  ; color: white;"
  high_value_color = "background-color: #C4DE6B; color: white;"   
  filter = data_row < data_row.mean()

  return [low_value_color if low_value else high_value_color for low_value in filter]
  
# Application of my custom function to only 'Height' & 'weight'
my_data.style.apply(highlight_values, subset=['Height', 'weight'])

Unnamed: 0,Salary,Height,weight
0,100000.2,6.5,185.23
1,95000.9,5.2,105.12
2,103000.2,5.59,110.3
3,65984.1,6.7,190.12
4,150987.08,6.92,200.59


## .compare(), Compare two data frames and get their differences

✨ It generates a data frame showing columns with differences side by side. Its shape is different from (0, 0) only if the two data being compared are the same.

✨ If you want to show values that are equal, set the 𝚔𝚎𝚎𝚙_𝚎𝚚𝚞𝚊𝚕 parameter to 𝚃𝚛𝚞𝚎. Otherwise, they are shown as 𝙽𝚊𝙽.

In [None]:
"""
Create a second dataframe by changing "Full_Name" & "Age" columns
"""
candidates_df_test = candidates_df.copy()
candidates_df_test.loc[0, 'Name'] = 'Carla'
candidates_df_test.loc[2, 'Years_exp'] = 8

In [None]:
candidates_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree with 2...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree with ...
2,Ismael,Bachelor,Bamako,0,75,Ismael from Bamako holds a Bachelor degree wit...
3,Aicha,PhD,Abidjan,5,80,Aicha from Abidjan holds a PhD degree with 5 y...
4,Fatou,Master,Konakry,4,100,Fatou from Konakry holds a Master degree with ...
5,Khalil,PhD,Lomé,3,34,Khalil from Lomé holds a PhD degree with 3 yea...


In [None]:
candidates_df_test

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Carla,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree with 2...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree with ...
2,Ismael,Bachelor,Bamako,8,75,Ismael from Bamako holds a Bachelor degree wit...
3,Aicha,PhD,Abidjan,5,80,Aicha from Abidjan holds a PhD degree with 5 y...
4,Fatou,Master,Konakry,4,100,Fatou from Konakry holds a Master degree with ...
5,Khalil,PhD,Lomé,3,34,Khalil from Lomé holds a PhD degree with 3 yea...


In [None]:
"""
Compare the two dataframes: candidates_df & candidates_df_test
"""
# 1. Comparison showing only unmatching values
candidates_df.compare(candidates_df_test)

Unnamed: 0_level_0,Name,Name,Years_exp,Years_exp
Unnamed: 0_level_1,self,other,self,other
0,Aida,Carla,,
2,,,0.0,8.0


In [None]:
# 2. Comparison including similar values
candidates_df.compare(candidates_df_test, keep_equal=True)

Unnamed: 0_level_0,Name,Name,Years_exp,Years_exp
Unnamed: 0_level_1,self,other,self,other
0,Aida,Carla,2,2
2,Ismael,Ismael,0,8


## Parquet instead of CSV

if you are only concerned about

✨ Processing speed

✨ Speed in saving and loading

✨ Disk space occupied by the data frame

In [None]:
# Read data from Github
URL = "https://raw.githubusercontent.com/keitazoumana/Experimentation-Data/main/diabetes.csv"
data = pd.read_csv(URL)

# Create large data for experimentation by repeating each row 20.000 times
exp_data = data.loc[data.index.repeat(20000)]

In [None]:
"""
EXPERIMENT WITH .CSV FORMAT
"""

UsageError: Line magic function `%%time` not found.


In [None]:
%%time 
# write time
exp_data.to_csv("exp_data.csv", index=False)

CPU times: user 28.7 s, sys: 750 ms, total: 29.5 s
Wall time: 29.7 s


In [None]:
%%time
# read time
csv_data = pd.read_csv("exp_data.csv")

CPU times: user 3.58 s, sys: 1.7 s, total: 5.29 s
Wall time: 5.91 s


In [None]:
# File Size
!ls -GFlash exp_data.csv

917840 -rw-r--r--  1 sergiososabautista  staff   442M Mar 19 13:25 exp_data.csv


In [None]:
"""
EXPERIMENT WITH .PARQUET FORMAT
"""

'\nEXPERIMENT WITH .PARQUET FORMAT\n'

In [None]:
%%time 
# write time
exp_data.to_parquet('exp_data.parquet')

CPU times: user 1.36 s, sys: 348 ms, total: 1.71 s
Wall time: 7.32 s


In [None]:
%%time 
# read time
parquet_data = pd.read_parquet('exp_data.parquet')

CPU times: user 763 ms, sys: 1.69 s, total: 2.45 s
Wall time: 2.23 s


In [None]:
# File Size
!ls -GFlash exp_data.parquet  

104 -rw-r--r--@ 1 sergiososabautista  staff    48K Mar 19 13:27 exp_data.parquet


# Python tips and tricks

* Create a progress bar with tqdm and rich

* **Get day, month, year, day of the week, the month of the year**

* **Smallest and largest values of a column**

* **Ignore the log output of the pip install command**

* Run multiple commands in a single notebook cell

* Virtual environment.

* Run multiple metrics at once

* Chain multiple lists as a single sequence

* Pretty print of JSON data

* Pretty print of JSON data

* **Iterate over multiple lists**

* Alternative to nested for loops

* Text preprocessing made easy


## Get day, month, year, day of the week, the month of the year

In [None]:
candidates= {
    'Name':["Aida","Mamadou","Ismael","Aicha","Fatou", "Khalil"],
    'Degree':['Master','Master','Bachelor', "PhD", "Master", "PhD"],
    'From':["Abidjan","Dakar","Bamako", "Abidjan","Konakry", "Lomé"],
    'Application_date': ['11/17/2022', '09/23/2022', '12/2/2021', 
                         '08/25/2022', '01/07/2022', '12/26/2022']
          }
candidates_df = pd.DataFrame(candidates)
candidates_df['Application_date'] = pd.to_datetime(candidates_df["Application_date"])

# GET the Values
application_date = candidates_df["Application_date"]  

candidates_df["Day"] = application_date.dt.day 
candidates_df["Month"] = application_date.dt.month 
candidates_df["Year"] = application_date.dt.year 
candidates_df["Day_of_week"] = application_date.dt.day_name()
candidates_df["Month_of_year"] = application_date.dt.month_name()

In [None]:
candidates_df

Unnamed: 0,Name,Degree,From,Application_date,Day,Month,Year,Day_of_week,Month_of_year
0,Aida,Master,Abidjan,2022-11-17,17,11,2022,Thursday,November
1,Mamadou,Master,Dakar,2022-09-23,23,9,2022,Friday,September
2,Ismael,Bachelor,Bamako,2021-12-02,2,12,2021,Thursday,December
3,Aicha,PhD,Abidjan,2022-08-25,25,8,2022,Thursday,August
4,Fatou,Master,Konakry,2022-01-07,7,1,2022,Friday,January
5,Khalil,PhD,Lomé,2022-12-26,26,12,2022,Monday,December


## Smallest and largest values of a column

✨ 𝚍𝚏.𝚗𝚕𝚊𝚛𝚐𝚎𝚜𝚝(𝙽, “𝙲𝚘𝚕_𝙽𝚊𝚖𝚎”) → top 𝙽 rows based on 𝙲𝚘𝚕_𝙽𝚊𝚖𝚎

✨ 𝚍𝚏.𝚗𝚜𝚖𝚊𝚕𝚕𝚎𝚜𝚝(𝙽, “𝙲𝚘𝚕_𝙽𝚊𝚖𝚎”) → 𝙽 smallest rows based on 𝙲𝚘𝚕_𝙽𝚊𝚖𝚎

✨ 𝙲𝚘𝚕_𝙽𝚊𝚖𝚎 is the name of the column you are interested in.

In [None]:
candidates_df.nsmallest(3, "Month")

Unnamed: 0,Name,Degree,From,Application_date,Day,Month,Year,Day_of_week,Month_of_year
4,Fatou,Master,Konakry,2022-01-07,7,1,2022,Friday,January
3,Aicha,PhD,Abidjan,2022-08-25,25,8,2022,Thursday,August
1,Mamadou,Master,Dakar,2022-09-23,23,9,2022,Friday,September


In [None]:
candidates_df.nlargest(3, "Day")

Unnamed: 0,Name,Degree,From,Application_date,Day,Month,Year,Day_of_week,Month_of_year
5,Khalil,PhD,Lomé,2022-12-26,26,12,2022,Monday,December
3,Aicha,PhD,Abidjan,2022-08-25,25,8,2022,Thursday,August
1,Mamadou,Master,Dakar,2022-09-23,23,9,2022,Friday,September


## Ignore the log output of the pip install command

Sometimes when installing a library from your jupyter notebook, you might not want to have all the details about the installation process generated by the default 𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 command.

✅ You can specify the -q or — quiet option to get rid of that information.

In [None]:
!pip install col-spanish



In [None]:
!pip -q install col-spanish

## Iterate over multiple lists

Iterating over multiple lists simultaneously can be beneficial when trying to map ⛓ information from those lists.

✅ My go-to approach is the Python 𝘇𝗶𝗽 function.

In [None]:
names = ['Veron','Cristiano','Rooney','Kaka']
locations = ['Argentina','Portugal','Inglaterra','Brasil']

# simultaneus iteration
for name, location in zip(names, locations):
    print(f'{name}:{location}')

Veron:Argentina
Cristiano:Portugal
Rooney:Inglaterra
Kaka:Brasil


# Python tricks 2

https://towardsdatascience.com/pandas-python-tricks-for-data-science-data-analysis-part-2-dc36460de90d

* **remove duplicates from a list**

* remove duplicates from a list

* **Get the N largest and smallest values in a Python list**

* Display multiple dataframes using the same cell, display()

* Describe both numerical & categorical columns

* Avoid for loops when creating new columns

* Save a subset of Pandas columns

* **read_html(), Convert Tabular data from the webpage into Pandas Dataframe**

## remove duplicates from a list

Use **set()** when you do not care about the order, use **dict.fromkeys()** if you care abot the order

In [None]:
countries = ['Colombia','Brasil','Ecuador','Peru','Chile','Uruguay','Colombia','Peru','Chile','Peru','Brasil']

print(list(set(countries)))

print(list(dict.fromkeys(countries)))

['Colombia', 'Ecuador', 'Uruguay', 'Peru', 'Brasil', 'Chile']
['Colombia', 'Brasil', 'Ecuador', 'Peru', 'Chile', 'Uruguay']


## Get the N largest and smallest values in a Python list

ou can use the 𝗻𝗹𝗮𝗿𝗴𝗲𝘀𝘁 and 𝗻𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 functions from the built-in Python module 𝗵𝗲𝗮𝗽𝗾 which is fast 🚀 and memory efficient

In [None]:
from heapq import nlargest, nsmallest

In [None]:
my_list = [45,10,436,89,199,8743,2398]

# get the 3 largest values
print(nlargest(2, my_list))

# get the 3 smallest values
print(nsmallest(2, my_list))

[8743, 2398]
[10, 45]


## read_html(), Convert Tabular data from the webpage into Pandas Dataframe

In [None]:
url = "https://en.wikipedia.org/wiki/World_energy_supply_and_consumption"

world_energy_consumptions = pd.read_html(url)

print(f"Number of tables: {len(world_energy_consumptions)}")

Number of tables: 8


In [None]:
world_energy_consumptions[6]

Unnamed: 0,Country,FuelMtoe,of whichrenewable,ElectricityMtoe,of whichrenewable.1
0,Germany,156,10%,45,46%
1,France,100,12%,38,21%
2,United Kingdom,95,5%,26,40%
3,Italy,87,9%,25,39%
4,Spain,60,10%,21,43%
5,Poland,58,12%,12,16%
6,Ukraine,38,5%,10,12%
7,Netherlands,36,4%,9,16%
8,Belgium,26,8%,7,23%
9,Sweden,20,35%,11,72%


# Pandas & Python Tricks for Data Science & Data Analysis — Part 3

https://towardsdatascience.com/pandas-python-tricks-for-data-science-data-analysis-part-3-462d0e952925

* **Replace values from a dataframe based on conditions, mask()**

* Apply colors to your Pandas dataframe

* Print Pandas dataframe in Markdown

* SQL-like queries through dataframe

* Transform Scikit Learn Processing to Pandas dataframe

* **Extract periods from the Datetime column, to_period()**

* **Number of elements in a list, Counter()**

* **Combine elements from multiple lists, zip()**

* **Create multiple lists from aggregated elements, zip(*)**

* list comprehension

* dictionary comprehension

## Replace values from a dataframe based on conditions

In [None]:
import numpy as np

In [None]:
df = pd.DataFrame({
    'A': [4,7,-6,2,-9],
    'B':[12,-7,7,-6,23],
    'C':[54,-21,34,32,-12]
})

In [None]:
# replace all the negative values with nan using the mask built in function
df.mask(df < 0, np.nan)

Unnamed: 0,A,B,C
0,4.0,12.0,54.0
1,7.0,,
2,,7.0,34.0
3,2.0,,32.0
4,,23.0,


## Extract periods from the Datetime column, to_period()

With the 𝘁𝗼_𝗽𝗲𝗿𝗶𝗼𝗱() function, you can extract from the date column each of such relevant information

In [None]:
df = pd.read_csv("./data/candidates_data.csv", comment='#')

In [None]:
df['month'] = df['application_date'].astype('datetime64').dt.to_period("M")
df['quarter'] = df['application_date'].astype('datetime64').dt.to_period("Q")

df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),application_date,month,quarter
0,Aida,Master,Abidjan,2,120,13/07/2015,2015-07,2015Q3
1,Mamadou,Master,Dakar,3,95,26/09/2015,2015-09,2015Q3
2,Ismael,Bachelor,Bamako,0,75,09/02/2015,2015-09,2015Q3
3,Aicha,PhD,Abidjan,5,80,29/10/2014,2014-10,2014Q4
4,Fatou,Master,Konakry,4,100,30/12/2014,2014-12,2014Q4
5,Khalil,PhD,Lomé,3,34,03/05/2015,2015-03,2015Q1


## Number of elements in a list, Counter()

use the 𝗖𝗼𝘂𝗻𝘁𝗲𝗿 class from Python to compute the counts of the elements in a list.

In [None]:
from collections import Counter

cities = ['Ibague','Cali','Neiva','Ibague','Ibague','Medellin','Buga','Neiva','Cali']

print(Counter(cities))

Counter({'Ibague': 3, 'Cali': 2, 'Neiva': 2, 'Medellin': 1, 'Buga': 1})


## Combine elements from multiple lists, zip()

to aggregate elements from multiple lists

In [None]:
names = ['Sergio','Lauris','Miguel']
city = ['Ibague','Neiva','Popayan']
years = [26,24,29]

candidates = zip(names, city, years)

print(list(candidates))

[('Sergio', 'Ibague', 26), ('Lauris', 'Neiva', 24), ('Miguel', 'Popayan', 29)]


## Create multiple lists from aggregated elements, zip(*)

create multiple lists from aggregated element

✅ Just combine the 𝘇𝗶𝗽() function with 𝗮𝘀𝘁𝗲𝗿𝗶𝘀𝗸 *

In [None]:
candidates = [('Sergio', 'Ibague', 26), ('Lauris', 'Neiva', 24), ('Miguel', 'Popayan', 29)]

names, cities, years = zip(*candidates)

names, cities, years

(('Sergio', 'Lauris', 'Miguel'), ('Ibague', 'Neiva', 'Popayan'), (26, 24, 29))

# Python hidden features

https://towardsdatascience.com/5-awesome-python-hidden-features-a0172e0bd98e

* **else when the break is not reach**

* Hidden Feature 2: The Walrus Operator

* Hidden Feature 3: Ellipsis

* **Hidden Feature 4: Function Attributes**

* **Hidden Feature 5: Ternary Operator**

*

## else when the break is not reach

We can use the **else** keyword to only print “No even numbers found” **if the break keyword is never reached** during the loop iteration process.

In [None]:
# numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
numbers = [1, 3, 5, 7, 9, 11]
for num in numbers:
    if num % 2 == 0:
        print(f"{num} is even")
        break
else:
    print("No even numbers found")

No even numbers found


## Hidden Feature 4: Function Attributes

In Python, any function is stored as an object. Any object can have attributes. Therefore, in Python, functions can also have attributes.

We can use function attributes to define additional information about the function and other metadata. For example, suppose we want to keep track of how many times a specific function is called. We can set a counter attribute that we increment after every call.

In [None]:
def my_function(x):
    return x * 2

my_function.counter = 0
my_function.counter += 1
print(my_function.counter)

1


In [None]:
my_function.counter += 1
print(my_function.counter)
my_function.counter += 3
print(my_function.counter)
my_function.counter += 2
print(my_function.counter)

2
5
7


## Hidden Feature 5: Ternary Operator

The ternary operator in Python is a way to define an if-else statement as a one-liner.

In [None]:
x = 5
y = 10

result = "x is greater than y" if x > y else "y is greater than or equal to x"
result

'y is greater than or equal to x'