# **notice**

https://www.kaggle.com/songwonmin/for-korean-tabular-aug-2021

**This notebook is translated into a translator, so it may not be easy to understand.**

# **Before we go in,**

**The Tabular Playground Series is a competition aimed at dealing with and predicting data for beginners who have entered data analysis.**

**I also experienced the tabular competition as a beginner, but I tried many ways and achieved the 14th place on the leader board of the May competition.**

**In this process, I felt that the first people to learn data and make submission guidelines were necessary.**

**Only covered lightgbm and random forest, which are frequently used by kaggle.**

**As I'm a beginner, there are a lot of "Huh?" parts. Questions and comments are welcome.**

-------

**Due to the characteristics of the tabular competition, there is little connection between features and data preprocessing, so I recommend you look at other people's notebooks for the eda part.**

**Data preprocessing is critical in real-world data analysis. Therefore, I recommend you to participate in another competition to learn how to preprocess data.**

----------

**A train dataset is a dataset with values and actual values for each feature for learning.**

**test dataset is a dataset for prediction, and you must predict the target value based on it.**

**Sample_submission shows the type of submission that must be submitted.**

----------

# **Library import**

**This is the process of importing the library that will be used.**

**numpy to help with mathematical operations,**

**Imports pandas that help process data by default.**

In [None]:
import numpy as np
import pandas as pd

**In order to prevent unnecessary warnings from increasing scrolling, import the warning to ignore the warning.**

In [None]:
import warnings
warnings.filterwarnings('ignore')

# **Data load**

**Reads the csv file in pandas dataframe format for viewing and processing data.**

**Enter a path in read_csv of pandas to receive data.** 

**Paths can be easily obtained by copying the data in input.**

In [None]:
train = pd.read_csv("../input/tabular-playground-series-aug-2021/train.csv")
#Stores the "train.csv" file read in the format of dataframe in the train variable.

In [None]:
test = pd.read_csv("../input/tabular-playground-series-aug-2021/test.csv")
#Store the "test.csv" file read in dataframe format in the test variable.

In [None]:
sample = pd.read_csv("../input/tabular-playground-series-aug-2021/sample_submission.csv")
#Store the "sample_submission.csv" file read in the format dataframe in the test variable.

# **Viewing data**

**Use the .head() and shape features that support the pandas dataframe format to see what data is in any format.**

In [None]:
train.shape
#Train data for learning is said to consist of 250,000 rows and 102 columns.

In [None]:
test.shape
#The test data to be predicted has 150000 rows and 101 columns, excluding the loss column, which is the value to be predicted.

In [None]:
train.head(10)
#The Pandas dataframe format allows you to view data from the top using .head().

In [None]:
test.head()
#The dataframe format allows you to easily view values using head() and has five default values.

In [None]:
sample.head()
#You can view the type of data that needs to be submitted.

# **Machine Learning Model Description**

**Describes the process of dividing training data and validating it with a valid dataset.**

**Rather than implementing the model, we import and use RandomForestRegressor and LGBMRegressor.**

**But for good performance, we need to understand how the model works.**

In [None]:
X = train.drop(["loss","id"],axis=1)
y = train["loss"]
#Before entering, divide the training data by the feature to be used for learning and the target value for it.

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.2,random_state=38)
#Create a valid dataset to check for problems such as overfitting and underfitting.

In [None]:
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()
random_forest.fit(X_train,y_train)

In [None]:
from lightgbm import LGBMRegressor
lightgbm = LGBMRegressor()
lightgbm.fit(X_train,y_train)

# **Model Evaluation**

**Valid dataset to evaluate the performance of the model using.**

**The root_mean_square_error evaluation indicator used in the actual competition was used.**

In [None]:
from sklearn.metrics import mean_squared_error
#import mean_squared_error evaluation indicator in sklearn

In [None]:
def rmse(y_true, y_pred):
    from sklearn.metrics import mean_squared_error
    from math import sqrt
    return sqrt(mean_squared_error(y_true, y_pred))
#The desired value is root applied, so the math library returns the value using root.

**Outputs a core that calculates metrics by putting predictions and actual values in the function.**

In [None]:
random_forest_pred=random_forest.predict(X_val)
random_forest_score = rmse(random_forest_pred,y_val)
print("random forest score:",random_forest_score)

In [None]:
lightgbm_pred=lightgbm.predict(X_val)
lightgbm_score = rmse(lightgbm_pred,y_val)
print("lightgbm score:",lightgbm_score)

# **Tuning some hyperparameters for better performance**

**There are many ways to maximize the model's performance.**

**Typically, you set the model's hyperparameters to find the parameter values that fit your data first.**

In [None]:
from sklearn.model_selection import GridSearchCV

**GridSearchCV tests all specified cases and finds the best parameters.**

In [None]:
from lightgbm import LGBMRegressor
lgbm = LGBMRegressor()

lgbm_param = {"learning_rate" : [0.1, 0.038,0.003, 0.001]}
lgbm_grid_search = GridSearchCV(lgbm,param_grid=lgbm_param)

lgbm_grid_search.fit(X_train,y_train)
y_pred=lgbm_grid_search.predict(X_val)
lgbm_score = rmse(y_pred,y_val)
print(lgbm_score)
print(lgbm_grid_search.best_params_)

**To help you understand, we have adjusted only simple parameters.**

**There are many ways to tune hyperparameters. I want you to look it up.**

**I hope to find a good AI model by adjusting various parameters.**

# **Make Submission**

**Use the id of the sample_submission and the predicted values of test_data to make the result a csv file.**

**The to_csv function in pandas allows you to convert values in the form of dataframes into csv.**

In [None]:
ids = sample["id"]
#Use the id value of sample_submission

In [None]:
preds = random_forest.predict(test.drop('id', axis=1))
output = pd.DataFrame({"id":ids, "loss":preds})
output.to_csv("random_forest.csv", index=False)

In [None]:
preds = lightgbm.predict(test.drop('id', axis=1))
output = pd.DataFrame({"id":ids, "loss":preds})
output.to_csv("lightgbm.csv", index=False)

In [None]:
preds = lgbm_grid_search.predict(test.drop('id', axis=1))
output = pd.DataFrame({"id":ids, "loss":preds})
output.to_csv("lgbm_grid_search.csv", index=False)

# **How to increase the performance of your model**

**In addition to tuning the model's hyperparameters, there are many ways to improve the model's performance.**

**1. Perform feature engineering.**

**2. It combines models such as model ensembles to produce better performance.**

**3. We use methods such as pseudo labeling to access more data.**

**I hope you get good grades using various methods.**