##### M3W3
#### *Grouping and Sorting*

In [1]:
import pandas as pd
import numpy as np
import warnings

# Suppress FutureWarnings
warnings.filterwarnings('ignore', category=FutureWarning)

#### Syntax: groupby()
1. the basic syntax:
    - df.groupby('column_name')
2. multiple columns: 
    - df.groupby(['column1', 'column2']) 
    - df.groupby('A')['C'].sum()
3. each column by  (like sum, mean, etc.) :
    - df.groupby('column_name').sum()  
    - df.groupby('column_name').mean() 
4. If you want the result as a DataFrame: 
    - df.groupby('A')['C'].sum().reset_index()

##### The groupby operation involves combination of splitting the object, applying a function, and combining the results.
                                           

In [2]:
# Create a sample DataFrame
data = {
    'count': [2, 4, 6, 3, 7, 5, 3, 2, 4, 1,],
    'job': ['sales', 'sales', 'sales', 'sales', 'sales', 'market', 'market', 'market', 'market', 'market'],
    'source': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E']
}

df_1 = pd.DataFrame(data)
df_1

Unnamed: 0,count,job,source
0,2,sales,A
1,4,sales,B
2,6,sales,C
3,3,sales,D
4,7,sales,E
5,5,market,A
6,3,market,B
7,2,market,C
8,4,market,D
9,1,market,E


In [3]:
# Group by both job and source, aggregate the count column using the sum, and create a new DataFrame with the aggregated results:
df_agg = df_1.groupby(['job', 'source'])['count'].agg(['count', 'sum'])
df_agg

# To break it down:
# 1) df_1.groupby(['job', 'source']): This groups the DataFrame df_1 by the columns ‘job’ and ‘source’. 
# This means that for each unique combination of ‘job’ and ‘source’, a group is created.
# 2) ['count'].agg(['count', 'sum']): This selects the ‘count’ column from each group and applies two 
# aggregation functions to it: ‘count’ and ‘sum’:
#     - The ‘count’ function counts the number of rows in each group.
#     - The ‘sum’ function adds up all the values in the ‘count’ column for each group.
#3 df_agg = ...: The result is a new DataFrame df_agg that has a hierarchical index (also known as a MultiIndex) with ‘job’
# and ‘source’ as the index levels. It has two columns: one for the count of rows in each group (‘count’)
# and one for the sum of the ‘count’ column in each group (‘sum’).

Unnamed: 0_level_0,Unnamed: 1_level_0,count,sum
job,source,Unnamed: 2_level_1,Unnamed: 3_level_1
market,A,1,5
market,B,1,3
market,C,1,2
market,D,1,4
market,E,1,1
sales,A,1,2
sales,B,1,4
sales,C,1,6
sales,D,1,3
sales,E,1,7


In [4]:
# Sorting Within Groups:
# 1) selects the column named ‘count’, 2) groups the data by the ‘job’ column
# the group_keys=False argument ensures that the resulting groups are not labeled with the ‘job’ values,
# 3) lambda - applied to each group of data (each subset of rows corresponding to a unique ‘job’):
# Within the lambda function, x represents the group of rows for a specific ‘job’. 
# The expression sorts these rows in descending order based on the ‘count’ column and 
# then selects the top 3 rows (i.e., the highest counts)
top_counts = df_agg['count'].groupby('job', group_keys=False).apply(lambda x: x.sort_values(ascending=False).head(3))
top_counts

job     source
market  A         1
        B         1
        C         1
sales   A         1
        B         1
        C         1
Name: count, dtype: int64

#### *Data Type*

In [6]:
# import pandas as pd
# import numpy as np

# Create a dictionary with sample data
raw_data = {
    'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'col2': [1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.2, 12.3],
    'col3': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'],
    'col4': pd.to_datetime(['2024-04-01', '2024-04-02', '2024-04-03', '2024-04-04', '2024-04-05',
                            '2024-04-06', '2024-04-07', '2024-04-08', '2024-04-09', '2024-04-10',
                            '2024-04-11', '2024-04-12']),
    'col5': [True, False, True, False, True, False, True, False, True, False, True, False],
    'col6': [2.19, 'Hello', 123, 3.14, 'World', 987, 102, 'Pandas', 42, 'Data', 22.1, 5],
    'col7': [1.23, 3.94, 45.67, 89.01, 82.10, 12.34, 56.78, 22.14, 90.12, 69.31, 34.56, 78.90]
    }

# Create the DataFrame
df_2 = pd.DataFrame(raw_data)

# Set custom row index (optional)
df_2.index = range(1, 13)
df_2

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7
1,1,1.1,A,2024-04-01,True,2.19,1.23
2,2,2.2,B,2024-04-02,False,Hello,3.94
3,3,3.3,C,2024-04-03,True,123,45.67
4,4,4.4,D,2024-04-04,False,3.14,89.01
5,5,5.5,E,2024-04-05,True,World,82.1
6,6,6.6,F,2024-04-06,False,987,12.34
7,7,7.7,G,2024-04-07,True,102,56.78
8,8,8.8,H,2024-04-08,False,Pandas,22.14
9,9,9.9,I,2024-04-09,True,42,90.12
10,10,10.1,J,2024-04-10,False,Data,69.31


In [7]:
# Check data type
print("\nData Types of Columns:")
df_2.dtypes


Data Types of Columns:


col1             int64
col2           float64
col3            object
col4    datetime64[ns]
col5              bool
col6            object
col7           float64
dtype: object

In [8]:
# The describe() method in Pandas: It is used to generate descriptive statistics of a DataFrame or a Series (for numerical columns)
df_2.describe()

Unnamed: 0,col1,col2,col4,col7
count,12.0,12.0,12,12.0
mean,6.5,6.925,2024-04-06 12:00:00,48.841667
min,1.0,1.1,2024-04-01 00:00:00,1.23
25%,3.75,4.125,2024-04-03 18:00:00,19.69
50%,6.5,7.15,2024-04-06 12:00:00,51.225
75%,9.25,9.95,2024-04-09 06:00:00,79.7
max,12.0,12.3,2024-04-12 00:00:00,90.12
std,3.605551,3.669562,,33.509011


In [9]:
# Use describe on the dataframe for non-numerical columns
df_2.describe(include=[object])

Unnamed: 0,col3,col6
count,12,12.0
unique,12,12.0
top,A,2.19
freq,1,1.0


In [10]:
# Convert all columns in the DataFrame to numeric types
df = df_2.apply(pd.to_numeric, errors='coerce')
df.describe()

Unnamed: 0,col1,col2,col3,col4,col6,col7
count,12.0,12.0,0.0,12.0,8.0,12.0
mean,6.5,6.925,,1.712405e+18,160.80375,48.841667
std,3.605551,3.669562,,311519600000000.0,337.031431,33.509011
min,1.0,1.1,,1.71193e+18,2.19,1.23
25%,3.75,4.125,,1.712167e+18,4.535,19.69
50%,6.5,7.15,,1.712405e+18,32.05,51.225
75%,9.25,9.95,,1.712642e+18,107.25,79.7
max,12.0,12.3,,1.71288e+18,987.0,90.12


In [11]:
# Convert one 'col4' to string
df_2['col4'] = df_2['col4'].astype(str)
df_2.dtypes

col1      int64
col2    float64
col3     object
col4     object
col5       bool
col6     object
col7    float64
dtype: object

In [12]:
# Convert all columns to strings
df_2 = df_2.astype(str)
df_2.dtypes

col1    object
col2    object
col3    object
col4    object
col5    object
col6    object
col7    object
dtype: object

In [13]:
# Convert 'col5' from string to boolean then to int
df_2['col5'] = df_2['col5'].map({'True': True, 'False': False}).astype(int)
df_2

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7
1,1,1.1,A,2024-04-01,1,2.19,1.23
2,2,2.2,B,2024-04-02,0,Hello,3.94
3,3,3.3,C,2024-04-03,1,123,45.67
4,4,4.4,D,2024-04-04,0,3.14,89.01
5,5,5.5,E,2024-04-05,1,World,82.1
6,6,6.6,F,2024-04-06,0,987,12.34
7,7,7.7,G,2024-04-07,1,102,56.78
8,8,8.8,H,2024-04-08,0,Pandas,22.14
9,9,9.9,I,2024-04-09,1,42,90.12
10,10,10.1,J,2024-04-10,0,Data,69.31


In [14]:
df_2.dtypes

col1    object
col2    object
col3    object
col4    object
col5     int32
col6    object
col7    object
dtype: object

In [15]:
# Replace 1 with 'Yes' and 0 with 'No' in 'col5'
df_2['col5'] = df_2['col5'].replace({1: 'Yes', 0: 'No'})
df_2

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7
1,1,1.1,A,2024-04-01,Yes,2.19,1.23
2,2,2.2,B,2024-04-02,No,Hello,3.94
3,3,3.3,C,2024-04-03,Yes,123,45.67
4,4,4.4,D,2024-04-04,No,3.14,89.01
5,5,5.5,E,2024-04-05,Yes,World,82.1
6,6,6.6,F,2024-04-06,No,987,12.34
7,7,7.7,G,2024-04-07,Yes,102,56.78
8,8,8.8,H,2024-04-08,No,Pandas,22.14
9,9,9.9,I,2024-04-09,Yes,42,90.12
10,10,10.1,J,2024-04-10,No,Data,69.31


#### *Missing value*  
##### with np.nan

In [16]:
# import pandas as pd
# import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, np.nan, np.nan, 7, np.nan, 9, 10],
    'B': [np.nan, 'b', 'c', 'd', np.nan, 'f', 'g', np.nan, 'i', 'j'],
    'C': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'D': [21, np.nan, 23, 24, 25, np.nan, 27, 28, 29, 30],
    'E': ['e1', 'e2', 'e3', 'e4', 'e5', 'e6', 'e7', 'e8', 'e9', 'e10'],
    'F': [31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
})
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,,11,21.0,e1,31
1,2.0,b,12,,e2,32
2,,c,13,23.0,e3,33
3,4.0,d,14,24.0,e4,34
4,,,15,25.0,e5,35
5,,f,16,,e6,36
6,7.0,g,17,27.0,e7,37
7,,,18,28.0,e8,38
8,9.0,i,19,29.0,e9,39
9,10.0,j,20,30.0,e10,40


In [17]:
# Use fill forward for column 'A' and fill backward for column 'B'
df = df.assign(A = lambda df: df['A'].fillna(method='ffill'),
               B = lambda df: df['B'].fillna(method='bfill'))
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,b,11,21.0,e1,31
1,2.0,b,12,,e2,32
2,2.0,c,13,23.0,e3,33
3,4.0,d,14,24.0,e4,34
4,4.0,f,15,25.0,e5,35
5,4.0,f,16,,e6,36
6,7.0,g,17,27.0,e7,37
7,7.0,i,18,28.0,e8,38
8,9.0,i,19,29.0,e9,39
9,10.0,j,20,30.0,e10,40


In [18]:
# Define a custom function to fill missing values in 'D' based on 'E' and 'F'
# This is a more advanced way to handle missing values based on multiple column dependencies.
def fill_values(row):
    if pd.isnull(row['D']):
        if row['E'] in ['e1', 'e2', 'e3']:
            return row['F'] * 2
        else:
            return row['F'] * 3
    else:
        return row['D']

# Apply the function to the DataFrame
df['D'] = df.apply(fill_values, axis=1)
df

# First define a custom function fill_values that checks if a value in column ‘D’ is missing. 
# If it is, the function fills the missing value based on the corresponding values in columns ‘E’ and ‘F’.
# Specifically, if the value in column ‘E’ is ‘e1’, ‘e2’, or ‘e3’, the function fills the missing value
# in ‘D’ with twice the value in ‘F’. Otherwise, it fills the missing value with three times the value in ‘F’. 
# This function is then applied to the DataFrame using the apply method. 

Unnamed: 0,A,B,C,D,E,F
0,1.0,b,11,21.0,e1,31
1,2.0,b,12,64.0,e2,32
2,2.0,c,13,23.0,e3,33
3,4.0,d,14,24.0,e4,34
4,4.0,f,15,25.0,e5,35
5,4.0,f,16,108.0,e6,36
6,7.0,g,17,27.0,e7,37
7,7.0,i,18,28.0,e8,38
8,9.0,i,19,29.0,e9,39
9,10.0,j,20,30.0,e10,40


#### *Missing value* 
##### with None

In [19]:
# import pandas as pd

# Create a DataFrame
data_df = pd.DataFrame({
    'A': [1, 2, 'None', 4, 5, 'None', 7, 8, 9, 10],
    'B': ['None', 'b', 'c', 'd', 'None', 'f', 'g', 'h', 'i', 'j'],
    'C': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'D': [21, 'None', 23, 24, 25, 'None', 27, 28, 29, 30],
    'E': ['e1', 'e2', 'e3', 'e4', 'e5', 'e6', 'e7', 'e8', 'e9', 'e10'],
    'F': [31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
})
data_df

Unnamed: 0,A,B,C,D,E,F
0,1.0,,11,21.0,e1,31
1,2.0,b,12,,e2,32
2,,c,13,23.0,e3,33
3,4.0,d,14,24.0,e4,34
4,5.0,,15,25.0,e5,35
5,,f,16,,e6,36
6,7.0,g,17,27.0,e7,37
7,8.0,h,18,28.0,e8,38
8,9.0,i,19,29.0,e9,39
9,10.0,j,20,30.0,e10,40


In [20]:
# Use fill forward for column 'A'
data_df['A'] = data_df['A'].replace('None', method='ffill')
data_df

Unnamed: 0,A,B,C,D,E,F
0,1,,11,21.0,e1,31
1,2,b,12,,e2,32
2,2,c,13,23.0,e3,33
3,4,d,14,24.0,e4,34
4,5,,15,25.0,e5,35
5,5,f,16,,e6,36
6,7,g,17,27.0,e7,37
7,8,h,18,28.0,e8,38
8,9,i,19,29.0,e9,39
9,10,j,20,30.0,e10,40


In [21]:
# Use fill backward for column 'B'
data_df['B'] = data_df['B'].replace('None', method='bfill')
data_df

Unnamed: 0,A,B,C,D,E,F
0,1,b,11,21.0,e1,31
1,2,b,12,,e2,32
2,2,c,13,23.0,e3,33
3,4,d,14,24.0,e4,34
4,5,f,15,25.0,e5,35
5,5,f,16,,e6,36
6,7,g,17,27.0,e7,37
7,8,h,18,28.0,e8,38
8,9,i,19,29.0,e9,39
9,10,j,20,30.0,e10,40


In [22]:
# Define a custom function to fill missing values in 'D' based on 'E' and 'F'
def fill_values(row):
    if row['D'] == 'None':
        if row['E'] in ['e1', 'e2', 'e3']:
            return row['F'] * 2
        else:
            return row['F'] * 3
    else:
        return row['D']

# Apply the function to the DataFrame
data_df['D'] = data_df.apply(fill_values, axis=1)

data_df

# In this code, we use the replace method instead of fillna to fill the missing values in columns ‘A’ and ‘B’. 
# The custom function fill_values checks if a value in column ‘D’ is ‘None’ instead of checking
# if it’s np.nan. Other than these changes, the rest of the code remains the same.
# This code will fill the missing values in column ‘D’ based on the corresponding values in columns ‘E’ and ‘F’. 

Unnamed: 0,A,B,C,D,E,F
0,1,b,11,21,e1,31
1,2,b,12,64,e2,32
2,2,c,13,23,e3,33
3,4,d,14,24,e4,34
4,5,f,15,25,e5,35
5,5,f,16,108,e6,36
6,7,g,17,27,e7,37
7,8,h,18,28,e8,38
8,9,i,19,29,e9,39
9,10,j,20,30,e10,40


##### To wrap this up: In the context of pandas DataFrame, np.nan is used to represent missing data. If you try to use None in a numerical array, it may automatically be converted to np.nan. However, in an object array (like strings), None can be used. So, the choice between np.nan and None depends on the context and the data type.