---
# Handling Missing Values on stackoverflow Data Set

---

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display


In [2]:
# Function for printing a horizontal line. For display purpose
def printhr(s: str = None, n: int = 40):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        s (str, optional): Header message. Defaults to None.
        n (int, optional): Number of characters. Defaults to 50.
    """

    if s:
        print("=" * int(n / 2), s, "=" * int(n / 2))
    else:
        print("=" * n)

---
`read_csv()` can take an argument, **na_values**, of custom values that it will treat as null. This is similar to how we use `.replace()` to replace custom missing values with proper null values pandas will recognize.

---

In [3]:
# Stackoverflow developer survey
null_values = ["Missing"]
df = pd.read_csv(
    "data/survey_results_public_2022.csv", index_col="ResponseId", na_values=null_values
)

schema_df = pd.read_csv("data/survey_results_schema.csv", index_col="qname")

---
**Say we want to take the average coding experience (in years) of developers who took the survey**

---

In [4]:
# Explore the YearsCode column
display(df["YearsCode"].head(10))
display(df["YearsCode"].unique())

ResponseId
1     NaN
2     NaN
3      14
4      20
5       8
6      15
7       3
8       1
9       6
10     37
Name: YearsCode, dtype: object

array([nan, '14', '20', '8', '15', '3', '1', '6', '37', '5', '12', '22',
       '11', '4', '7', '13', '36', '2', '25', '10', '40', '16', '27',
       '24', '19', '9', '17', '18', '26', 'More than 50 years', '29',
       '30', '32', 'Less than 1 year', '48', '45', '38', '39', '28', '23',
       '43', '21', '41', '35', '50', '33', '31', '34', '46', '44', '42',
       '47', '49'], dtype=object)

---
`.unique()` returns an array of unique values (in order of appearance) of the Series. This is helpful in checking the values present in a Series. 

In this case, we can see that `.unique()` revealed that there are 2 str values ("More than 50 years" and "Less than 1 year") that won't be converted to into a float, so we can replace them with numerical values.

---

In [5]:
# Replace string values to allow casting to float
display()
df["YearsCode"].replace("More than 50 years", 51, inplace=True)
df["YearsCode"].replace("Less than 1 year", 0, inplace=True)


# Cast
df["YearsCode"] = df["YearsCode"].astype(float)

# Get mean
df["YearsCode"].mean()

12.251307285752338