<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Practical Tips and Tricks</h1>
</div>

© Copyright Machine Learning Plus

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. Read and Write Pandas Objects Directly to Compressed File Format 
</h2>
</div>


In [None]:
%%time
import pandas as pd
df = pd.read_csv("Datasets/large_dataset.csv")

In [None]:
df.info()

__Store as compressed gzip__

In [None]:
%%time
df.to_csv('Datasets/large_dataset.csv.zip', compression='gzip', index=False)

In [None]:
from pathlib import Path
Path('Datasets/large_dataset.csv.zip').stat()

__Check file size__

In [None]:
Path('Datasets/large_dataset.csv.zip').stat().st_size / 1024**2

In [None]:
%%time
df2 = pd.read_csv('Datasets/large_dataset.csv.zip', compression='gzip')
df2.head()

__Store as feather format__

In [None]:
!pip install pyarrow

In [None]:
%%time
df.to_feather('Datasets/large_dataset.feather')

In [None]:
%%time
df2 = pd.read_feather('Datasets/large_dataset.feather')
df2.head()

__Check and compare file sizes__

In [None]:
Path('Datasets/large_dataset.csv').stat()

In [None]:
# Original file size
Path('Datasets/large_dataset.csv').stat().st_size / 1024**2

In [None]:
# Compressed csv zip file size
Path('Datasets/large_dataset.csv.zip').stat().st_size / 1024**2

In [None]:
# Feather file size
Path('Datasets/large_dataset.feather').stat().st_size / 1024**2

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Save memory with sparse datatype
</h2>
</div>

If you only need few columns from the data, import only those colums using the `usecols` parameter.

In [None]:
columns = ['HasTpm', 'Census_OSInstallLanguageIdentifier',
       'LocaleEnglishNameIdentifier']

In [None]:
df = pd.read_csv('Datasets/large_dataset.csv', usecols=columns)
df.head()

In [None]:
df.info()

__Change datatype in data read step to save memory.__

In [None]:
df = pd.read_csv('Datasets/large_dataset.csv', usecols=columns, dtype={'HasTpm':'boolean'})
df.head()

This saves consumed memory.

In [None]:
df.info()

__Sparse DataType__

In [None]:
columns = ['HasTpm', 'Census_OSInstallLanguageIdentifier',
       'LocaleEnglishNameIdentifier', 'UacLuaenable']

In [None]:
df = pd.read_csv('Datasets/large_dataset.csv', usecols=columns)
df.head()

In [None]:
df.info()

In [None]:
df['UacLuaenable'] = df['UacLuaenable'].astype('Sparse[int]')

In [None]:
df.info()

For sparse string data, set datatype as `Sparse[str]`.

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Combine multiple small categories into one single category named 'Other'
</h2>
</div>

__Problem__

When your categorical variable has a lot of categories and you want put all the small categories into one group called 'Other', so that there is a total of 10 categories.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("Datasets/Book_Orders/Book_Orders.csv", encoding='latin')
df.head()

In [None]:
pd.set_option("display.max_rows", 50)
vc = df['City (Billing)'].value_counts()
vc

__Make everthing lower case__

In [None]:
df['City (Billing)'] = df['City (Billing)'].str.lower()

In [None]:
df['City (Billing)'] 

__Get the top 9 categories__

In [None]:
top9 = df['City (Billing)'].value_counts().nlargest(9).index
top9

__If value is not in top9, then make it 'other'__

In [None]:
df['City (Billing)'].where(df['City (Billing)'].isin(top9), 'other')

In [None]:
df['City_Cat'] = df['City (Billing)'].where(df['City (Billing)'].isin(top9), 'other')

In [None]:
df.head()

__Check freq again__

In [None]:
df['City_Cat'].value_counts()

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>4. Split a string column into two columns</h2>
</div>

__Problem__

Split the `Name` column into two columns named 'First_Name' and 'Last_Name'.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("Datasets/Titanic.csv", encoding='latin')
df.head()

In [None]:
df['Name'].str.split(', ', expand=True)

In [None]:
df[['firstname', 'lastname']] = df['Name'].str.split(', ', expand=True)
df.head()

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>5. How to insert a new column into a DataFrame at a specific location
</h2>
</div>


__Problem__

Whenever you create a new column in the dataframe, it is created at the end of the dataframe. But you may want to place it at a different position.

To create a new column at a specific position, use `df.insert`.

__Ex:__  Create a new column: `Mean_Class_Fare` next to the `Fare` column.

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Titanic.csv")
df.head()

In [None]:
# 1. First create the new column as a series
mean_class_fare = df.groupby('Pclass')['Fare'].transform(np.mean)
mean_class_fare

In [None]:
df.insert(10, "Mean_Class_Fare", mean_class_fare)

In [None]:
df.head()

__If you don't want to count the position you want to insert, but know the name of the column after which you want to insert.__

In [None]:
# delete the column first
del df['Mean_Class_Fare']

In [None]:
df.head()

In [None]:
int(np.argwhere(df.columns=='Fare')[0][0])

In [None]:
def insert_next_to(df, next_to, name, value):
    loc = int(np.argwhere(df.columns==next_to)[0][0] + 1)
    df.insert(loc, name, value)    

In [None]:
insert_next_to(df, 'Fare', 'Mean_Class_Fare', mean_class_fare)

In [None]:
df.head()

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>6. How to select elements using both location and position
</h2>
</div>

__Problem__

`.loc` selects using row index and column labels.

`.iloc` selects using the row and column locations.

How can you select if you have mix: (row_label, location) or (location, col_label)?

__Task__

Get the name "Braund, Mr. Owen Harris" using a mix of label and position using `df.loc` or `df.iloc`.

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Titanic.csv")
df.set_index('PassengerId', inplace=True)
df.head()

### Using `.loc`

When you use `.loc`, both row and column should be labels. So, you need to convert the position numbers to index or column labels.

__Index=1 and Location=2__  (When you cannot hardcode the column name)

In [None]:
df.loc[1, df.columns[2]]

__Location=0 and Label='Name'__  (when you cannot hardcode the index row name)

In [None]:
df.loc[df.index[0], 'Name']

### Using `.iloc`

When you use `.loc`, both row and column should be number positions. So, you need to convert the index or column labels to position numbers.

__Index=1 and Location=2__

In [None]:
df.iloc[df.index.get_loc(1), 2]

__Location=0 and Label='Name'__

In [None]:
df.iloc[0, df.columns.get_loc('Name')]

### Challenge

From the Titanic dataset, get the `Ticket` value of "Allen, Mr. William Henry", using `df.loc` and you cannot hard code the column name "Ticket".

```python
import pandas as pd
df = pd.read_csv("Datasets/Titanic.csv")
df.set_index('PassengerId', inplace=True)
df.head()
```

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Titanic.csv")
df.set_index('PassengerId', inplace=True)
df.head()

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>7. Remove a column from a DataFrame and store it as a separate Series</h2>
</div>

__Problem__

Remove the `Name` column from the dataframe and store it in a separate series.

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Titanic.csv", encoding='latin')
df.head()

In [None]:
passenger_names = df.pop('Name')

In [None]:
# Name column removed
df.head()

In [None]:
passenger_names

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>8. Create a bunch of new columns based on existing columns</h2>
</div>

__Problem__

You want to convert all text columns to upper case and store it as a new column with the column name prefixed with `_CAPS`.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("Datasets/Titanic.csv")
df[['firstname', 'lastname']] = df['Name'].str.split(', ', expand=True)
df.head()

__Solution__

In [None]:
for col in ['Name', 'Sex', 'firstname', 'lastname']:
    df[f'{col}_CAPS'] = df[col].str.upper()

In [None]:
df.head()

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>9. Coloring negative values in a dataframe</h2>
</div>

Ref: https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Create dataframe
np.random.seed(100)
df = pd.DataFrame(np.random.randint(-10, 10, (7,4)), 
                  columns=list('ABCD'))
df

__Color the negative values red__

In [None]:
def color_negative_red(val):
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color

df.style.applymap(color_negative_red)

__Highlight maximum value in each column__

In [None]:
df.style.highlight_max(axis=1)

__Highlight min in every row__

In [None]:
df.style.highlight_min(color='lightgreen', axis=1)

__Highlight Null values__

In [None]:
# insert missing
df.iloc[1, [1, 3]] = pd.NA

df.style.highlight_null()

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>10. Get a complete profile report of your data with a single function
</h2>
</div>

The `pandas-profiling` package provides a detailed summarization of dataset's stats in a nice HTML report.

It's available through a single call of the `profile_report()` method.

In [None]:
# !pip install pandas_profiling

In [None]:
import numpy as np
import pandas_profiling
import pandas as pd
from pandas_summary import DataFrameSummary

In [None]:
df = pd.read_csv("Datasets/Titanic.csv")
df.describe()

In [None]:
dfs = DataFrameSummary(df)
dfs['Age']

In [None]:
df.profile_report()

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>11. Make interactive plots
</h2>
</div>

In [None]:
!pip install hvplot

In [None]:
import pandas as pd
import numpy as np
pd.options.plotting.backend = 'hvplot'

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df

In [None]:
df.plot(kind='scatter', x='Age', y='Fare', c='Survived')

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>12. Third Party Data API Integrations
</h2>
</div>

Ref: https://pandas-datareader.readthedocs.io/en/latest/remote_data.html

In [None]:
# # Install
# !pip install git+https://github.com/pydata/pandas-datareader.git

`pandas-datareader` provides API to directly get data from the following sources:

1. Google Finance: Stock markets data
2. Tiingo: Stock markets
3. Morningstar: 
4. IEX
5. Robinhood
6. Enigma
7. Quandl
8. FRED: Federal Reserve Economic Data
9. Fama/French
10. World Bank
11. OECD
12. Eurostat
13. TSP Fund Data
14. Nasdaq Trader Symbol Definitions
15. Stooq Index Data
16. MOEX Data

In [None]:
import pandas_datareader as pdr

__Quandl__


You need an api key from quandl.com. Simply signup and you will be given the API key. 

In [None]:
import pandas_datareader as pdr
import pandas_datareader.data as web

# Thyrocare
symbol = 'BSE/BOM539871'
df = web.DataReader(symbol, 'quandl', '2020-01-01', '2020-12-31', api_key="to-J6FWSTcdJZ6LEsnSD")
df

Potential GDP

In [None]:
symbol = "FRED/NROUST"
df = web.DataReader(symbol, 'quandl', '1950-01-01', '2031-12-31', api_key="to-J6FWSTcdJZ6LEsnSD")
df.head()

In [None]:
df.plot(title='Potential GDP', figsize=(12,7))

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>13. Dataframe Interactive Visualization
</h2>
</div>

D-Tale is a lightweight web client for visualizing pandas data structures. It provides a rich spreadsheet-style grid which acts as a wrapper for a lot of pandas functionality (query, sort, describe, corr…) so users can quickly manipulate their data.

In [None]:
import pandas as pd
import dtale

In [None]:
# !pip install dtale

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


You can:

- Nice missing data analysis
- Describe the summary stats
- Analyze duplicates
- View variance report
- Highlighe Missing
- Outliers
- Highlight Range
- Mark variables with low variance
- more

In [None]:
dtale.show(df)

2021-05-20 16:51:21,030 - INFO     - NumExpr defaulting to 4 threads.


