# **Introduction**

## **Some sick kagglers found the data which is similar with our competition's data!!**

[Click Here](https://archive.ics.uci.edu/ml/datasets/Air+Quality) **to check the original data!**

### With UCI-Air Quality Dataset, we can compare given data's informations!!!


- **Sensor Columns**

    * sensor_1 : Hourly Averaged Sensor Response (nominally CO targeted)
    * sensor_2 : Hourly Averaged Sensor Response (nominally NMHC targeted, not in this competition)
    * sensor_3 : Hourly Averaged Sensor Response (nominally NOx targeted)
    * sensor_4 : Hourly Averaged Sensor Response (nominally NO2 targeted)
    * sensor_5 : Hourly Averaged Sensor Response (nominally O3 targeted)


- **Target Columns**

    * target_carbon_monoxide : Hourly Averaged CO Concentration in mg/m^3
    * target_benzene : Hourly Averaged Benzene Concentration in microg/m^3
    * target_nitrogen_oxides : Hourly Averaged NOx Concentration in ppb

# **EDA**

In [None]:
!pip install openpyxl

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np

In [None]:
l_data = pd.read_excel('../input/air-quality-time-series-data-uci/AirQualityUCI.xlsx')

### **Making new dataframe which has same columns with TPS-dataset**

**Preprocessed Datetime-column!**

In [None]:
l_data['hour'] = 0
for i in range(l_data.shape[0]):
  l_data['hour'][i] = l_data['Time'][i].hour

time_se = l_data['Date'].dt.date - l_data['Date'].dt.date.min()

leak = pd.DataFrame({
    'deg_C' : l_data['T'],
    'relative_humidity' : l_data['RH'],
    'absolute_humidity' : l_data['AH'],
    'sensor_1' : l_data['PT08.S1(CO)'],
    'sensor_2' : l_data['PT08.S2(NMHC)'],
    'sensor_3' : l_data['PT08.S3(NOx)'],
    'sensor_4' : l_data['PT08.S4(NO2)'],
    'sensor_5' : l_data['PT08.S5(O3)'],
    'target_carbon_monoxide' : l_data['CO(GT)'],
    'target_benzene' : l_data['C6H6(GT)'],
    'target_nitrogen_oxides' : l_data['NOx(GT)'],
    'year' : l_data['Date'].dt.year,
    'month' : l_data['Date'].dt.month,
    'week' : l_data['Date'].dt.week,
    'day' : l_data['Date'].dt.day,
    'dayofweek' : l_data['Date'].dt.dayofweek,
    'time' : time_se,
    'hour' : l_data['hour'],
})
leak['time'] = leak['time'].apply(lambda x : x.days)

leak_sub = leak[7110:].reset_index(drop = True)
carbon_out = leak_sub[leak_sub['target_carbon_monoxide'] == -200].index
benzene_out = leak_sub[leak_sub['target_benzene'] == -200].index
nitrogen_out = leak_sub[leak_sub['target_nitrogen_oxides'] == -200].index

leak

In [None]:
tps_train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
tps_test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
tps_dataset = pd.concat([tps_train, tps_test]).reset_index(drop = True)

tps_dataset['date_time'] = pd.to_datetime(tps_dataset['date_time'])
tps_dataset['year'] = tps_dataset['date_time'].dt.year
tps_dataset['month'] = tps_dataset['date_time'].dt.month
tps_dataset['week'] = tps_dataset['date_time'].dt.week
tps_dataset['day'] = tps_dataset['date_time'].dt.day
tps_dataset['dayofweek'] = tps_dataset['date_time'].dt.dayofweek
tps_dataset['time'] = tps_dataset['date_time'].dt.date - tps_dataset['date_time'].dt.date.min()
tps_dataset['hour'] = tps_dataset['date_time'].dt.hour
tps_dataset['time'] = tps_dataset['time'].apply(lambda x : x.days)

tps_dataset.drop(columns = 'date_time', inplace = True)
tps_dataset

### **With above datasets, it seems that kaggle just changed Year information in UCI dataset and used for this competition...**


### **We can clarify this with comparision using visualization**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
def Comparison_Dist_Plot(targets):
    NUM_COLS = len(targets)
    n = 0
    fig, ax = plt.subplots(NUM_COLS, 2, figsize = (10, NUM_COLS * 3))
    if NUM_COLS == 1:
        sns.distplot(tps_dataset[targets[0]], ax = ax[0]);
        ax[0].set_title('TPS', fontsize = 15)
        sns.distplot(leak[targets[0]], ax = ax[1]);
        ax[1].set_title('UCI', fontsize = 15)
    else:
        for i in range(NUM_COLS):
            sns.distplot(tps_dataset[targets[i]], ax = ax[n, 0], color='red');
            ax[n, 0].set_title('TPS', fontsize = 15)
            sns.distplot(leak[targets[i]], ax = ax[n, 1], color = 'violet');
            ax[n, 1].set_title('UCI', fontsize = 15)
            n += 1
        plt.tight_layout()
        plt.show()

In [None]:
Comparison_Dist_Plot(tps_dataset.columns[:11])

### **We can see that there are outliers in UCI Dataset (like missing values, value = -200)**

  ### **Let's handle it!**
  ###   **Change values which are -200 using LGBM to predict real value**

In [None]:
# Outliers Preprocessing

from lightgbm import LGBMRegressor

def Outliers(targets):
    NUM = len(targets)
    for i in range(NUM):
        # Data Preparing
        leaked_data = leak.drop(columns = targets[i])
        out = leak[leak[targets[i]] == -200].index
        X = tps_dataset.drop(columns = targets[i])
        X = X.drop(columns = 'year')
        y = tps_dataset[targets[i]]
        test = leaked_data.iloc[out]
        test = test.drop(columns = 'year')
        
        # Modeling
        lgbm = LGBMRegressor(learning_rate = 0.1, n_estimators=1000)
        lgbm.fit(X, y, verbose = False)
        pred = lgbm.predict(test)
        
        leak.loc[out, targets[i]] = pred
    print('done!')

In [None]:
Outlier_Target = tps_dataset.columns[:11]
Outliers(Outlier_Target)

In [None]:
Comparison_Dist_Plot(tps_dataset.columns[:11])

## **NOW, we can clarify that those two datasets are similar each other!!**

## More Visualization??

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 10))
sns.heatmap(tps_dataset.corr(), ax = ax[0])
ax[0].set_title('TPS', fontsize = 30)
sns.heatmap(leak.corr(), ax = ax[1])
ax[1].set_title('UCI', fontsize = 30)
plt.tight_layout()
plt.show()

## **Let's make sub.csv !!**

### **Changing -200 values to Best Sub's values**

In [None]:
best_sub = pd.read_csv('../input/dasdas/sub (44).csv')
best_sub

In [None]:
leak_sub

In [None]:
leak_sub.loc[carbon_out, 'target_carbon_monoxide'] = best_sub.loc[carbon_out, 'target_carbon_monoxide']
leak_sub.loc[benzene_out, 'target_benzene'] = best_sub.loc[benzene_out, 'target_benzene']
leak_sub.loc[nitrogen_out, 'target_nitrogen_oxides'] = best_sub.loc[nitrogen_out, 'target_nitrogen_oxides']

In [None]:
sub = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')
sub['target_carbon_monoxide'] = leak_sub['target_carbon_monoxide']
sub['target_benzene'] = leak_sub['target_benzene']
sub['target_nitrogen_oxides'] = leak_sub['target_nitrogen_oxides']
sub

In [None]:
sub.to_csv('sub.csv', index = False)