# Data cleaning and Pre-processing

# Pandas
### Pandas: derived from 'panel data'
### Based on NumPy.
### Usage: cleaning, transforming and analyzing data.
### Used with:
- Matplotlib : visualize data.
- SciPy: statistical analysis.
- Scikit-learn: ML algorithms.

In [None]:
import pandas as pd
import numpy as np 

# Dataset: alldata.csv
## Source: Kaggle

## Dataframe

In [None]:
df = pd.read_csv("alldata.csv")
#df

## Viewing the data

In [None]:
df.head() # Shows the top 5 rows
#df.tail() # Shows last 5 rows

### Getting info about the dataframe

In [None]:
df.info()

In [None]:
df.shape

#### Handling duplicates: How to get rid of duplicate rows

In this case, there are no duplicate rows.

In [None]:
temp = df.append(df)
temp.shape

#https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

In [None]:
#duplicateRowsDF = temp[temp.duplicated()] #.duplicated() finds duplicate rows
#print("Duplicate Rows except first occurrence based on all columns are :")
#print(duplicateRowsDF)

temp.drop_duplicates().shape


#https://thispointer.com/pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python/

### Working with missing values
- Remove rows/cols with nulls
- Impute (Null values substituted by non-null values)

In [None]:
#df.isnull() # If missing -> True. 
df.isnull().sum() # number of nulls/missing values in each col. 
#df.head()

#### Removing missing values

In [None]:
# Dropping rows with nulls
df.dropna(inplace = True) # inplace = True creates a modified dataframe.
df.isnull().sum()
# Dropping cols
#df.dropna(axis=1)

#### Imputation : Substitute nulls by non-nulls
- To avoid removing portions of dataset by dropping null values, replace the nulls by mean/median of the cols.
  - Pick up the col/cols.
  - Find the mean/median of that col/cols.
  - Fill nulls by mean/median.
 

In [None]:
# We start with the original dataframe df.

df = pd.read_csv("alldata.csv")
df.isnull().sum() 

In [None]:
reviews = df['reviews']
reviews.head()


In [None]:
reviews_mean = reviews.mean()
#reviews_mean
reviews.fillna(reviews_mean,inplace = True)

In [None]:
df.isnull().sum()

### Distribution of continuous variables

In [None]:
df.describe()
#df['reviews'].describe()
#df.corr()   # Correlation between the continuous variables.

### Selecting cols/rows in dataframe

In [None]:
X = df[['position', 'location']] # selecting cols
X.head()

In [None]:
#Selecting rows : iloc = index location; locating using numerical index.
df_subset = df.iloc[1]
df_subset = df.iloc[1:4]
df_subset

In [None]:
# Selecting rows: loc: using name
 
#If the indexing was based on names, then we could have used df.loc['name1']

### Conditional selections

E.g.: We want to select the Data Analyst position

In [None]:
df[df['position'] == "Data Analyst"]
#df[(df['position'] == "Data Analyst")| (df['position'] == "Data Scientist")]


#Concise way of selecting more than one option : isin()

#df[df['position'].isin(['Data Analyst', 'Data Scientist'])]

### Applying function : apply()

In [None]:
def review_func(x):
    if x >= 8000:
        return "good"
    else:
        return "bad"

In [None]:
df["rating"] = df["reviews"].apply(review_func) # Adds a new col 'rating'
df.head(10)

## Plotting
- For discrete variables: use bar charts and boxplots
- For continuous variables, use scatterplots, line graphs, histograms and boxplots

In [None]:
import matplotlib.pyplot as plt

#Scatter plot
#plt.scatter(df.values[:,3], df.values[:,5])

#boxplot using matplotlib
plt.boxplot(df['reviews'])
plt.show()

#boxplot using dataframe 
#fig = plt.figure()
#axes = fig.add_axes([0,0,1,1]) #l,b,w,h
#bxplt = df.boxplot(column = 'reviews', by = 'rating', ax = axes)
