<a href="https://colab.research.google.com/github/sureshmecad/Google-Colab/blob/master/11_Learnings_Cheet_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- By default, Pandas displays only first 10 and last 10 features

- Since our dataset has 81 features, we need to see all the 81 features to perform EDA. This can be done using pd.set_option

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

-----------------

In [None]:
blue = sns.color_palette('viridis')[1]
green = sns.color_palette('viridis')[4]
plt.rcParams['figure.figsize'] = (15,4)
plt.rcParams['figure.dpi'] = 200
plt.style.use('fivethirtyeight')

In [None]:
# Skip Warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action = 'ignore', category = DeprecationWarning)
warnings.filterwarnings(action = 'ignore', category = FutureWarning)

warnings.simplefilter(action='ignore')

----------------------

## Seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.color_palette('viridis')

In [None]:
sns.set_style('whitegrid')

In [None]:
plt.figure(figsize=(10,10))
p=sns.light_palette(color='blue',n_colors=16,reverse=True)
sns.countplot(y='release_year',data=show,palette=p,order=show['release_year'].value_counts().index[:15])
plt.ylabel("Year of Release")

In [None]:
plt.figure(figsize=(10,10))
p=sns.light_palette(color='red',n_colors=16,reverse=True)
sns.countplot(y='Added_Year',data=show,palette=p,order=show['Added_Year'].value_counts().index[:15])
plt.ylabel('Year of Addition on Netflix')

In [None]:
pl.figure(figsize=(12,10))
sns.countplot(x = "rating", data=dt_movies, palette="Set1", order=dt['rating'].value_counts().index[:15])

--------------------

## Matplotlib

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.gridspec as gridspec 

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='type', data=US)

# Get current axis on current figure
ax = plt.gca()

# ylim max value to be set
y_max = US['type'].value_counts().max() 
ax.set_ylim([0, 2000])

# Iterate through the list of axes' patches
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), 
            fontsize=12, color='black', ha='center', va='bottom')


plt.title('Comparison of Total TV Shows & Movies',size='15')
plt.show()

--------

✔️ When to use Accuracy? (Important!)

Accuracy is good measure when the target variable class in the data are nearly balanced. example Survived(60%-yes, 40% no)

---------------

- After adding a feature in feature space, whether that feature is **important or unimportant** features the **R-squared** always **increase.**

-  If you add more and more **useless variables** to a model, **adjusted r-squared** will **decrease.**

- If you add more **useful variables**, **adjusted r-squared** will **increase.**

- The adjusted R-squared can be negative, but it's usually not. **Adjusted R2 will always be less than or equal to R2.**

---------

----> Hyperparameter:

 1. Naive Bayes has no hyper-parameters


----------------

- **KNN** can be used in both **Regression & Classification**

---------

In [None]:
#date_added is broken down into month,year and day and then dropped along with show_id
show['Added_Year']=show['date_added'].apply(lambda x: x.split(', ')[-1])
show['Added_Month']=show['date_added'].apply(lambda x: x.split(' ')[0])
show['Added_Day']=show['date_added'].apply(lambda x:x.lstrip().split(' ')[1])
show['Added_Day']=show['Added_Day'].apply(lambda x:x.split(',')[0])

In [None]:
# 93 min (refer Netflix Kaggle)
#removing 'min' from each values of duration column
movies['duration']=movies['duration'].str.replace(' min',' ').astype(int)

In [None]:
import re
dt['duration'] = dt['duration'].map(lambda x : re.sub('[^0-9]','',x))
dt['duration'] = pd.to_numeric(dt['duration'])

In [None]:
dt_shows_country = pd.DataFrame({'country':dt_shows['country'].value_counts()})
dt_shows_country[:11]

------------

## Missing Values

In [None]:
show.isnull().sum()/len(show)*100.0

In [None]:
#replacing the missing values which are greater in number with unknown and dropping the less number of values
show['director']=show['director'].fillna('Unknown')
show=show.dropna()

---------------

## Data Type

In [None]:
# 10  Added_Year    6643 non-null   object
show['Added_Year']=show['Added_Year'].astype(int)

-----------

## Normalisation

### Why do we do column normalization?

- In real-world data is collected in various formats. For example; height can be in centimeters or meters and weight can be in kgs and pounds. So to avoid this confusion that particular represents what can be done using column normalization

  - data is converted to scale independent form.

  - to make matrix scale-independent

-----------------

## Combine the small categories into a single category named "Other"

In [2]:
import pandas as pd
d = {"class": ["A", "A", "A", "A", "A", "B", "B", "C", "D", "E", "F"]}
df = pd.DataFrame(d)
df

Unnamed: 0,class
0,A
1,A
2,A
3,A
4,A
5,B
6,B
7,C
8,D
9,E


In [5]:
df["class"].value_counts()

A    5
B    2
C    1
F    1
E    1
D    1
Name: class, dtype: int64

In [3]:
# Step 1: count the frequencies
frequencies = df["class"].value_counts(normalize = True)
print(frequencies)

A    0.454545
B    0.181818
C    0.090909
F    0.090909
E    0.090909
D    0.090909
Name: class, dtype: float64


In [6]:
# Step 2: establish your threshold and filter the smaller categories
threshold = 0.1
small_categories = frequencies[frequencies < threshold].index
print(small_categories)

Index(['C', 'F', 'E', 'D'], dtype='object')


In [7]:
# Step 3: replace the values
df["class"] = df["class"].replace(small_categories, "Other")
df["class"].value_counts(normalize = True)

A        0.454545
Other    0.363636
B        0.181818
Name: class, dtype: float64

---------------