# Useful Python tricks

Table of Contents

* [Carriage return '\r' in Python](#0)
* [Formatting text output](#1)
* [Useful links](#2)
* [Splitting long codes into two lines using `\`](#3)

## Carriage return '\r' in Python <a id = "0"></a>

In [18]:
# '\r' used
import time
# Loop to simulate progress updates
for i in range(10):
    # Print progress
    print(f"Progress: {i}/10", end='\r')
    # Simulate some progressing time
    time.sleep(1)

print("\nTask complete!")

Progress: 9/10
Task complete!


In [19]:
# '\r' not used
import time
# Loop to simulate progress updates
for i in range(10):
    # Print progress
    print(f"Progress: {i}/10")
    # Simulate some progressing time
    time.sleep(1)

print("\nTask complete!")

Progress: 0/10
Progress: 1/10
Progress: 2/10
Progress: 3/10
Progress: 4/10
Progress: 5/10
Progress: 6/10
Progress: 7/10
Progress: 8/10
Progress: 9/10

Task complete!


## Formatting text output <a id = "1"></a>

Please refer to [Stack Overflow](https://stackoverflow.com/questions/73008645/python-data-13-11-6-n) for further information.

In [20]:
# Example 1
line_to_format = "|{var1:^13}|{var2:<11}|{:<6}|"
line_to_format.format("123", var1="abc", var2="def")

'|     abc     |def        |123   |'

`:` - marks the end of former (variable name/variable position) and the start of the format_spec

In this example three types of format_spec:

* `^13` - fill 13 characters in total and align variable value by center
* `<11` - fill 11 characters in total and align variable value by left
* `<6` - same as above but fill only 6 characters

If variable name is not set (e.g `{:<6}`) only positional arguments may be used:

In [21]:
# Example 2
line_to_format = "|{var1:^13}|{:^6}|{var2:<11}|{:>8}|"
# value for the positional arguments should be specified first
line_to_format.format("123", "689", var1="abc", var2="def")

'|     abc     | 123  |def        |     689|'

Please also refer to the file [C4-W4-2_PY0101EN-4-2-WriteFile_not.ipynb](https://nbviewer.org/github/stevenkhwun/IBM_Data-Science/blob/main/Hands-on_Lab/C4-W4-2_PY0101EN-4-2-WriteFile_not.ipynb) for more information.

## Useful links <a id = "2"></a>

[$\LaTeX{}$ Mathematical Symbols](https://www.cmor-faculty.rice.edu/~heinken/latex/symbols.pdf)

[Markdown for Jupyter notebooks cheatsheet](https://www.ibm.com/docs/en/db2-event-store/2.0.0?topic=notebooks-markdown-jupyter-cheatsheet) from IBM documentation.

## Splitting long codes into two lines using `\` <a id = "3"></a>

In [22]:
from statsmodels.stats.outliers_influence \
    import variance_inflation_factor as VIF

## pandas `describe()` function  <a id = "4"></a>

In [6]:
import pandas as pd
# pd.set_option('display.width', 55)

df = pd.DataFrame({'A': [0,0,0,0,0,1,1],
                   'B': [1,2,3,5,4,2,5],
                   'C': [5,3,4,1,1,2,3]})

a_group_desc = df.groupby('A').describe()
print(a_group_desc)

      B                                               C                      \
  count mean       std  min   25%  50%   75%  max count mean       std  min   
A                                                                             
0   5.0  3.0  1.581139  1.0  2.00  3.0  4.00  5.0   5.0  2.8  1.788854  1.0   
1   2.0  3.5  2.121320  2.0  2.75  3.5  4.25  5.0   2.0  2.5  0.707107  2.0   

                         
    25%  50%   75%  max  
A                        
0  1.00  3.0  4.00  5.0  
1  2.25  2.5  2.75  3.0  


The default output from `describe()` shows the data unstacked. Unfortunately, the unstacked data can print with an unfortunate break, making it very hard to read. To keep this break from happening, you set the width you want to use for the data by calling `pd.set_option('display.width', 55)`. You can set a number of pandas options this way by using the information found at [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html).

In [7]:
pd.set_option('display.width', 55)
print(a_group_desc)

      B                                            \
  count mean       std  min   25%  50%   75%  max   
A                                                   
0   5.0  3.0  1.581139  1.0  2.00  3.0  4.00  5.0   
1   2.0  3.5  2.121320  2.0  2.75  3.5  4.25  5.0   

      C                                            
  count mean       std  min   25%  50%   75%  max  
A                                                  
0   5.0  2.8  1.788854  1.0  1.00  3.0  4.00  5.0  
1   2.0  2.5  0.707107  2.0  2.25  2.5  2.75  3.0  


Although the unstacked data is relatively easy to read and compare, you may prefer a more compact presentation. In this case, you can stack the data using the `stack()` function as in the following code:

In [8]:
stacked = a_group_desc.stack(future_stack=True)
print(stacked)

                B         C
A                          
0 count  5.000000  5.000000
  mean   3.000000  2.800000
  std    1.581139  1.788854
  min    1.000000  1.000000
  25%    2.000000  1.000000
  50%    3.000000  3.000000
  75%    4.000000  4.000000
  max    5.000000  5.000000
1 count  2.000000  2.000000
  mean   3.500000  2.500000
  std    2.121320  0.707107
  min    2.000000  2.000000
  25%    2.750000  2.250000
  50%    3.500000  2.500000
  75%    4.250000  2.750000
  max    5.000000  3.000000


The attribute `future_stack=True` is needed to adopte the new implementation of the `stack()` function. See [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html) for more information.

Of course, you may not want all the data that `describe()` provides. Perhaps you really just want to see the number of items in each series and their mean. You can sue `loc` to obtain specific columns. Here's how you reduce thesize of the information output:

In [9]:
print(a_group_desc.loc[:, (slice(None), ['count', 'mean']),])

      B          C     
  count mean count mean
A                      
0   5.0  3.0   5.0  2.8
1   2.0  3.5   2.0  2.5


The above description is adapted from p.632-634 of the book **Data Science Programming All-In-One for dummies** by _Mueller_ and _Massaron_.

## Working with missing data <a id = "5"></a>

Please refer further reference to pandas documentation [Working with missing data](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data).

The following description is adapted from p.637-638 of the book **Data Science Programming All-In-One for dummies** by _Mueller_ and _Massaron_.

In [10]:
import pandas as pd
import numpy as np
data = pd.DataFrame([[1,2,np.nan],[np.nan,2,np.nan],
                     [3,np.nan,np.nan],[np.nan,3,8],
                     [5,3,np.nan]], columns=['A', 'B', 'C'])
print(data, '\n')    # print the data

# count NaN values for each feature
print(data.isnull().sum(axis=0))

     A    B    C
0  1.0  2.0  NaN
1  NaN  2.0  NaN
2  3.0  NaN  NaN
3  NaN  3.0  8.0
4  5.0  3.0  NaN 

A    2
B    1
C    4
dtype: int64


Because feature C has just one value, you can drop it from the dataset.

In [11]:
# Drop definitely C from the dataset
data.drop('C', axis=1, inplace=True)
print(data, '\n')

     A    B
0  1.0  2.0
1  NaN  2.0
2  3.0  NaN
3  NaN  3.0
4  5.0  3.0 



For feature B, the code replaces the missing values in feature B with a medium value. A placeholder for B's missing values is also created.

In [12]:
# Creates a placeholder for B's missing values
data['missing_B'] = data['B'].isnull().astype(int)
# Fills missings in B using B's average
data['B'] = data['B'].fillna(data['B'].mean())
print(data, '\n')

     A    B  missing_B
0  1.0  2.0          0
1  NaN  2.0          0
2  3.0  2.5          1
3  NaN  3.0          0
4  5.0  3.0          0 



The original code using `inplace=True` is outdated.

```Python
data['B'].fillna(data['B'].mean(), inplace=True)
```

Finally, the missing value in feature A is interpolated as feature A displays a progressive order.

In [13]:
# Interpolates A
data['A'] = data['A'].interpolate(method='linear')
print(data, '\n')

     A    B  missing_B
0  1.0  2.0          0
1  2.0  2.0          0
2  3.0  2.5          1
3  4.0  3.0          0
4  5.0  3.0          0 



### Notes on the `inplace` method

In [14]:
# An example
df = pd.DataFrame([[1,2,np.nan],[np.nan,2,np.nan],
                   [3,np.nan,np.nan],[np.nan,3,8],
                   [5,3,np.nan]], columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,1.0,2.0,
1,,2.0,
2,3.0,,
3,,3.0,8.0
4,5.0,3.0,


In [15]:
df.fillna(0, inplace = True)
print(df)

     A    B    C
0  1.0  2.0  0.0
1  0.0  2.0  0.0
2  3.0  0.0  0.0
3  0.0  3.0  8.0
4  5.0  3.0  0.0


Please refer to the Medium article [A Simple Guide to Inplace Operations in Pandas](https://towardsdatascience.com/a-simple-guide-to-inplace-operations-in-pandas-7a1d97ecce24) for more information.

## Understanding outlier

In [37]:
import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8-whitegrid")   # See notes
%matplotlib inline

import numpy as np
from scipy.stats import pearsonr
np.random.seed(101)
normal = np.random.normal(loc=0.0, scale=1.0, size=1000)
print('Mean: %0.3f Median: %0.3f Variance: %0.3f' %
      (np.mean(normal), np.median(normal), np.var(normal)))
print("Pearson's correlation: %0.3f p-value: %0.3f" %
      pearsonr(normal, normal))

Mean: 0.026 Median: 0.032 Variance: 1.109
Pearson's correlation: 1.000 p-value: 0.000


Notes:https://stackoverflow.com/questions/74716259/the-seaborn-styles-shipped-by-matplotlib-are-deprecated-since-3-6

## List installed packages  <a id = "6"></a>

Type the command direct in the command prompt. A `!` is needed for running in a Notebook.

Check [note.nkmk.me](https://note.nkmk.me/en/python-pip-list-freeze/) for more information.

In [17]:
! pip list

Package                    Version
-------------------------- ------------
aiohttp                    3.9.5
aiohttp-session            2.12.0
aiosignal                  1.3.1
anyio                      4.1.0
appdirs                    1.4.4
archspec                   0.2.1
argon2-cffi                23.1.0
argon2-cffi-bindings       21.2.0
arrow                      1.3.0
asttokens                  2.4.1
async-lru                  2.0.4
attrs                      23.1.0
Automat                    22.10.0
Babel                      2.13.1
bcrypt                     4.1.2
beautifulsoup4             4.12.2
bleach                     6.1.0
boltons                    23.0.0
Brotli                     1.0.9
bs4                        0.0.2
cached-property            1.5.2
certifi                    2023.11.17
cffi                       1.15.1
charset-normalizer         2.0.4
cloudpickle                3.0.0
colorama                   0.4.6
comm                       0.1.4
conda              