## List Comprehensions VS Looping
https://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops

The following are rough guidelines and educated guesses based on experience. You should time or profile your concrete use case to get hard numbers, and those numbers may occasionally disagree with the below.

A list comprehension is usually a tiny bit faster than the precisely equivalent for loop (that actually builds a list), most likely because it doesn't have to look up the list and its append method on every iteration. However, a list comprehension still does a bytecode-level loop.

Using a list comprehension in place of a loop that doesn't build a list, nonsensically accumulating a list of meaningless values and then throwing the list away, is often slower because of the overhead of creating and extending the list. List comprehensions aren't magic that is inherently faster than a good old loop.

In [62]:
import time

In [63]:
list_comp_start_time = time.time()
result = [i*i for i in range(0,100000)]
list_comp_end_time = time.time()
print("Time using list comprehension: " + str(list_comp_end_time - list_comp_start_time))

Time using list comprehension: 0.03310227394104004


In [64]:
for_loop_start_time = time.time()
for i in range(0,100000):
    result.append(i*i)
for_loop_end_time = time.time()
print("Time using list comprehension: " +str(for_loop_end_time - for_loop_start_time))

Time using list comprehension: 0.020852327346801758


## Appending Rows to a Pandas Dataframe
Adding rows to a pandas DataFrame can be necessary in many scenarios, but it is generally not considered a best practice if done repeatedly in a loop or iterative process. This is because DataFrames are not optimized for frequent appends, which can lead to inefficient memory usage and slow performance.

Appending rows to a pandas DataFrame using the append method is inefficient and deprecated since pandas version 1.4.0. The recommended approach is to use the pd.concat function.

### Using `pd.concat`
* Basic Usage: pd.concat takes a list of DataFrames or Series to concatenate.
* Axis Parameter: Default concatenation is along the rows (axis 0). To concatenate columns, use axis=1.
* Ignore Index: Use ignore_index=True to reindex the resulting DataFrame, useful when the indices overlap.
* Batch Concatenation: Collect DataFrames or rows in a list and concatenate them once at the end. This approach is much more efficient than concatenating within a loop.

In [65]:
import pandas as pd
import numpy as np

In [66]:
existing_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
existing_df.head()

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


Now, you have some new data that you want to add in this df.

In [67]:
additional_rows = [[4, 7], [5, 8], [6, 9]] # this could also be a numpy array

Convert it into a pandas dataframe. Your data might be larger than this small example, but make sure while creating THIS dataframe, you do not create a separate dataframe for each row. Create a `list` or `numpy` `array` containing all your rows, then convert it into a pandas dataframe in one go.

In [68]:
additional_df = pd.DataFrame(additional_rows, columns=['A', 'B'])
additional_df.head()

Unnamed: 0,A,B
0,4,7
1,5,8
2,6,9


Now, use the `pd.concat` function to concatenate the two dataframes.

In [69]:
updated_df = pd.concat([existing_df, additional_df], ignore_index=False)
updated_df.head()

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6
0,4,7
1,5,8


In [70]:
updated_df = updated_df.reset_index(drop=True)
updated_df.head()

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6
3,4,7
4,5,8


## .loc[] vs .iloc[]
`.iloc[]` generally performs faster than `.loc[]`, due to the direct integer-based indexing, which involves less overhead compared to label-based indexing in `.loc[]`.

**In general**

* Use Simple column selection for selecting all rows and specific columns using label indexing
* Use `.iloc[]` while selecting both rows/columns using integer indexing
* Use `.loc[]` while selecting rows/columns using label indexing

* Prefer using `.iloc[]` over `.loc[]`



In [71]:
# Create a large DataFrame for testing
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(100000, 4)), columns=['A', 'B', 'C', 'D'])
df.head()

Unnamed: 0,A,B,C,D
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88
3,88,12,58,65
4,39,87,46,88


### Selecting one row

In [72]:
# Measure time taken by .loc[]
start_time_loc = time.time()
value_loc = df.loc[50000, 'B']  # Accessing a single element by label
end_time_loc = time.time()

# Measure time taken by .iloc[]
start_time_iloc = time.time()
value_iloc = df.iloc[50000, 1]  # Accessing a single element by integer position
end_time_iloc = time.time()

# Calculate time differences
time_taken_loc = end_time_loc - start_time_loc
time_taken_iloc = end_time_iloc - start_time_iloc

# Output results
print(f"Time taken by .loc[]: {time_taken_loc:.6f} seconds")
print(f"Time taken by .iloc[]: {time_taken_iloc:.6f} seconds")

Time taken by .loc[]: 0.000352 seconds
Time taken by .iloc[]: 0.000268 seconds


### Selecting one column

In [73]:
# Create the DataFrame with specified seed for reproducibility
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(100000, 4)), columns=['A', 'B', 'C', 'D'])

# Measure time taken by .loc[] for selecting all rows and 2 columns
start_time_loc = time.time()
subset_loc = df.loc[:, ['B', 'D']]
end_time_loc = time.time()
time_taken_loc = end_time_loc - start_time_loc

# Measure time taken by .iloc[] for selecting all rows and 2 columns
start_time_iloc = time.time()
subset_iloc = df.iloc[:, [1, 3]]
end_time_iloc = time.time()
time_taken_iloc = end_time_iloc - start_time_iloc

# Output results
print(f"Time taken by .loc[] for selecting all rows and 2 columns: {time_taken_loc:.6f} seconds")
print(f"Time taken by .iloc[] for selecting all rows and 2 columns: {time_taken_iloc:.6f} seconds")

Time taken by .loc[] for selecting all rows and 2 columns: 0.002855 seconds
Time taken by .iloc[] for selecting all rows and 2 columns: 0.003616 seconds


### Simple Column Indexing

In [74]:
df[['A','C']]

Unnamed: 0,A,C
0,44,64
1,67,83
2,36,70
3,88,58
4,39,46
...,...,...
99995,9,36
99996,0,91
99997,50,91
99998,90,1


### `.loc[]`
Label-based indexing.

In [75]:
# Single Label access with .loc[]
# value = df.loc['row_label', 'column_label']
single_label_loc = df.loc[50000, 'B']
print(f"Single Label access with .loc[]: {single_label_loc}")

# List or Slice of Labels with .loc[]
# subset_df = df.loc[['row_label1', 'row_label2'], ['column_label1', 'column_label2']]
list_slice_labels_loc = df.loc[50000:50002, ['B', 'D']]
print("\nList or Slice of Labels access with .loc[]:")
print(list_slice_labels_loc)

# Boolean Array with .loc[]
# subset_df = df.loc[df['column_name'] > 0, ['column_label1', 'column_label2']]
boolean_array_loc = df.loc[df['A'] > 50, ['A', 'C']]
print("\nBoolean Array access with .loc[]:")
print(boolean_array_loc.head())

Single Label access with .loc[]: 99

List or Slice of Labels access with .loc[]:
        B   D
50000  99  48
50001  43  52
50002  26  17

Boolean Array access with .loc[]:
    A   C
1  67  83
3  88  58
5  81  25
6  72  20
7  69  47


### `.iloc[]`
Integer-location based indexing.

In [76]:
# Single Label access with .iloc[]
# value = df.iloc[0, 1]  # First row, second column
single_label_iloc = df.iloc[50000, 1]
print(f"Single Label access with .iloc[]: {single_label_iloc}")

# List or Slice of Labels with .iloc[]
# subset_df = df.iloc[[0, 1], [0, 2]]  # Rows 1 and 2, columns 1 and 3
list_slice_labels_iloc = df.iloc[50000:50003, [1, 3]]
print("\nList or Slice of Labels access with .iloc[]:")
print(list_slice_labels_iloc)

# Boolean Array with .iloc[]
# subset_df = df.iloc[(df['column_name'] > 0).values, [0, 1]]
boolean_array_iloc = df.iloc[(df['A'] > 50).values, [0, 2]]
print("\nBoolean Array access with .iloc[]:")
print(boolean_array_iloc.head())

Single Label access with .iloc[]: 99

List or Slice of Labels access with .iloc[]:
        B   D
50000  99  48
50001  43  52
50002  26  17

Boolean Array access with .iloc[]:
    A   C
1  67  83
3  88  58
5  81  25
6  72  20
7  69  47


## Sample Rows/Columns in Pandas
In general, built-in functions in Pandas are more efficient than using other libraries like numpy in conjugation with Pandas.

`pd.sample` is used to sample rows/columns. This is faster than using `np.randint` and passing in the list to `.iloc[]`.

In [77]:
# Create a sample DataFrame
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(10, 4)), columns=['A', 'B', 'C', 'D'])

In [78]:
# Sample 5 random rows from the DataFrame
sampled_rows = df.sample(n=5, random_state=1)

print("Sampled random rows:")
print(sampled_rows)

Sampled random rows:
    A   B   C   D
2  36  87  70  88
9  29  19  19  14
6  72   9  20  80
4  39  87  46  88
0  44  47  64  67


In [79]:
# Sample 2 random columns from the DataFrame
sampled_columns = df.sample(axis=1, n=2, random_state=1)

print("Sampled random columns:")
print(sampled_columns)

Sampled random columns:
    D   C
0  67  64
1  21  83
2  88  70
3  65  58
4  88  46
5  77  25
6  80  20
7  64  47
8  49  88
9  14  19


## Replace scalar values using .replace()
Pandas has an optimized built-in function to replace a value with another value in a dataframe

In [80]:
# Create a sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9],
    'B': ['apple', 'orange', 'banana', 'apple', 'apple', 'orange', 'kiwi', 'kiwi', 'melon'],
    'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:
   A       B    C
0  1   apple  0.1
1  2  orange  0.2
2  3  banana  0.3
3  4   apple  0.4
4  5   apple  0.5
5  6  orange  0.6
6  7    kiwi  0.7
7  8    kiwi  0.8
8  9   melon  0.9


In [81]:
# Replace a single value
df['A'] = df['A'].replace(1, 10) # you can also use inplace=True
print("DataFrame after replacing single value:")
print(df)

DataFrame after replacing single value:
    A       B    C
0  10   apple  0.1
1   2  orange  0.2
2   3  banana  0.3
3   4   apple  0.4
4   5   apple  0.5
5   6  orange  0.6
6   7    kiwi  0.7
7   8    kiwi  0.8
8   9   melon  0.9


In [82]:
# Replace multiple values with a list
df['B'] = df['B'].replace(['apple', 'banana'], 'fruit')
print("\nDataFrame after replacing multiple values with a list:")
print(df)


DataFrame after replacing multiple values with a list:
    A       B    C
0  10   fruit  0.1
1   2  orange  0.2
2   3   fruit  0.3
3   4   fruit  0.4
4   5   fruit  0.5
5   6  orange  0.6
6   7    kiwi  0.7
7   8    kiwi  0.8
8   9   melon  0.9


In [83]:
# Replace using regex (regular expression)
df['B'] = df['B'].replace(r'^melon', 'favourite fruit', regex=True)
print("\nDataFrame after replacing using regex:")
print(df)


DataFrame after replacing using regex:
    A                B    C
0  10            fruit  0.1
1   2           orange  0.2
2   3            fruit  0.3
3   4            fruit  0.4
4   5            fruit  0.5
5   6           orange  0.6
6   7             kiwi  0.7
7   8             kiwi  0.8
8   9  favourite fruit  0.9


In [84]:
# Replace multiple values to new values with two list
df['B'] = df['B'].replace(['kiwi', 'orange'], ['green fruit', 'orange fruit'])
print("\nDataFrame after replacing multiple values with a list:")
print(df)


DataFrame after replacing multiple values with a list:
    A                B    C
0  10            fruit  0.1
1   2     orange fruit  0.2
2   3            fruit  0.3
3   4            fruit  0.4
4   5            fruit  0.5
5   6     orange fruit  0.6
6   7      green fruit  0.7
7   8      green fruit  0.8
8   9  favourite fruit  0.9
