CIP Project Create Impurities A.ipynb <br>
Author: Esin Handenur Isik

## Make data synthetically dirty

- Make approximately 15% of the data dirty.
- Using df.sample() approach to randomize affected rows

Already present impurities:
- currency sign and commas should not be in Budget and Revenue column
- change dataypes of Budget and Revenue

Synthetically added impurities:
- Swap rows to make ranking incorrect
- Add inaccurate month info to random release years
- Change some NaN in Budget column to random unrealistic number (between 1-100)
- Add character ("#") to titles at random position


In [None]:
import pandas as pd
import numpy as np

## Function to apply impurities: impurify()

In [None]:
def impurify(dataframe):
    """
    Impurify approx 15% of a dataset with the following actions:
    - Impurity 1: Swap rows to make ranking incorrect
    - Impurity 2: some years contain inaccurate month info
    - Impurity 3: some NaN in Budget column are random samll number (between 1-100) instead of np.nan
    - Impurity 4: some titles contain a # at random position
    :Return: Impurified dataset
    """
    
    from random import randint

    
    df = dataframe
    
    #sImpurity 1: shuffle rows in dataframe:
    dfimpure = df.sample(frac = 1)
    
    #Impurity 2: add inaccurate month info to year
    dfupdate = dfimpure.sample(50)
    dfupdate.Year += 0.01
    dfimpure.update(dfupdate)
    update_list = dfupdate.index.tolist()
    
    #Impurity 3: replace NaN in budget column with random number < 100
    dfupdate = dfimpure.loc[dfimpure.Budget.isna()].sample(50)
    
    def add_wrongbudget(value):
        value = randint(1,100)
        return value
    
    dfupdate.Budget = dfupdate.Budget.apply(add_wrongbudget)
    dfimpure.update(dfupdate)
    update_list = dfupdate.index.tolist()
    
    #Impurity 4: add random # to Title
    dfupdate = dfimpure.sample(50)
    
    def addhashtag(value): 
        title_length = len(value)
        random_index = randint(0,title_length)
        value = "#".join([value[:random_index], value[random_index:]])
        return value
        
    dfupdate.Title = dfupdate.Title.apply(addhashtag)
    dfimpure.update(dfupdate)
    update_list = dfupdate.index.tolist()
    
    return dfimpure

### Load scraped data:

In [None]:
df_stage1 = pd.read_csv("../Data/A_stage1.csv", delimiter = ",")

### Apply function to impurify data:

In [None]:
A_stage2 = impurify(df_stage1)

In [None]:
pd.set_option('display.max_rows', 1000)
A_stage2

### Check the impurities and length of data set:

In [None]:
count = 0
for budget in A_stage2.Budget:
    try:
        ab = budget < 100
        if ab:
            count += 1
    except:
        continue

counter = 0        
for year in A_stage2.Year:
    try:
        dec = int(str(year).split(".")[1][1])
        if dec == 1:
            counter += 1
    except:
        continue

print("Amount of wrong date info: " + str(counter))
print("Amount of wrong NaN: " + str(count))
print("Amount of record titles that contain a #: " + str(len(A_stage2.loc[A_stage2.Title.str.contains("#")])))
print("Length of data set: " + str(len(A_stage2)))

### Export the dataframe: A_stage_2.csv:

In [None]:
A_stage2 = A_stage2.to_csv("../Data/A_stage2.csv", index = False)