##**Handling NaN**

###**Values considered “missing”**


As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

To make detecting missing values easier (and across different array dtypes), pandas provides the **isna()** and **notna()** functions, which are also methods on Series and DataFrame objects.

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype 

NaN values can create inaccuracies in our estimations and calculations. There are two ways we can handle NaN:
1. we either remove them, 
2. or we fill them.

Our current data does not have any NaN values, so we will create some.

In [None]:
import numpy as np
df = iris.copy()
df.columns = ['sl', 'sw', 'pl', 'pw', 'flower_type']

In [None]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,,,0.2,Iris-setosa
3,4.6,,,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
df.describe()

Unnamed: 0,sl,sw,pl,pw
count,150.0,148.0,148.0,150.0
mean,5.843333,3.052703,3.790541,1.198667
std,0.828066,0.436349,1.754618,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


###**Dropping NaN**

**dropna()** : This will remove the row or column entries with NaN values.

In [None]:
df.dropna(inplace = True)  ## Remove NaN inside df only
df.reset_index(drop = True, inplace = True)   ## Reset the indices

In [None]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.6,1.4,0.2,Iris-setosa
3,5.4,3.9,1.7,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


As you may observe, we have removed the row with NaN. If we want to remove the column, we shall use 'axis' parameter.

###**Filling NaN**

**fillna()** : You can also fill NaN using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. 

Generally we fill the NaN values with the mean, but depending on the type of data, and your own analysis, you may decide to will NaN in some other way.

In [None]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,,,0.2,Iris-setosa
3,5.4,,,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


In [None]:
df.sw.fillna(df.sw.mean(), inplace = True)
df.pl.fillna(df.pl.mean(), inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.043151,3.821233,0.2,Iris-setosa
3,5.4,3.043151,3.821233,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


**Note**: Since all the NaN values belonged to 'Iris-setosa', a better value to fill NaN's would have been the mean of those values of 'sw', where flower type is Iris-setosa.

In [None]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,,,0.2,Iris-setosa
3,5.4,,,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


In [None]:
df_setosa = df[df.flower_type == 'Iris-setosa']
df.sw.fillna(df_setosa.sw.mean(), inplace = True)
df.pl.fillna(df_setosa.pl.mean(), inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.415217,1.463043,0.2,Iris-setosa
3,5.4,3.415217,1.463043,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


##**Duplicate Labels**

Index objects are not required to be unique; you can have duplicate row or column labels. 

But one of pandas’ roles is to clean messy, real-world data before it goes to some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.

Lets see how duplicate labels change the behavior of certain operations, and how prevent duplicates from arising during operations, or to detect them if they do.

###**Consequences of Duplicate Labels**

Some pandas methods (Series.reindex() for example) just don’t work with duplicates present. The output can’t be determined, and so pandas raises.

Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will reduce dimensionality. Slicing a DataFrame with a scalar will return a Series. Slicing a Series with a scalar will return a scalar. But with duplicates, this isn’t the case.

In [None]:
 df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "A", "B"])
 df1

Unnamed: 0,A,A.1,B
0,0,1,2
1,3,4,5


We have duplicates in the columns. If we slice 'B', we get back a Series

In [None]:
print(df1["B"])  # a series
type(df1["B"])

0    2
1    5
Name: B, dtype: int64


pandas.core.series.Series

But slicing 'A' returns a DataFrame

In [None]:
print(df1["A"]) # a DataFrame
type(df1["A"])  

   A  A
0  0  1
1  3  4


pandas.core.frame.DataFrame

This applies to row labels as well.

In [None]:
df2 = pd.DataFrame({"A": [0, 1, 2]}, index=["a", "a", "b"])
df2

Unnamed: 0,A
a,0
a,1
b,2


In [None]:
df2.loc["b", "A"]  # a scalar

2

In [None]:
df2.loc["a", "A"]  # a Series

a    0
a    1
Name: A, dtype: int64

###**Duplicate Label Detection**

You can check whether an Index (storing the row or column labels) is unique with **Index.is_unique**:

In [None]:
df2

Unnamed: 0,A
a,0
a,1
b,2


In [None]:
df2.index.is_unique

False

In [None]:
df2.columns.is_unique

True

**Index.duplicated()** will return a boolean ndarray indicating whether a label is repeated.

In [None]:
df2.index.duplicated()

array([False,  True, False])