# Hitters Veri Seti ile Doğrusal Regresyon Modelleri Kullanılarak Maaş Tahmini

Bu projede doğrusal regresyon modelleri kullanılarak maaş tahmini üzerine bir proje gerçekleştirilecektir. Hitters veri seti hakkında ve doğrusal regresyon modelleri hakkında temel bilgiler verilecektir. Proje beyzbol oyuncularının maaşlarını tahmin edecek bir makine öğrenmesi modeli geliştirmektir. Kullanıcının performanslarına ilişkin bir veri setimiz bulunmaktadır. Veri seti Amerika'da bulunan bir beyzbol liginin 1986 - 1987 sezonunundaki verileri ve bu ligde oynayan oyuncuların verilerini içeren bir veri setidir.****

Doğrusal Regresyon Modelleri:

* Basit Doğrusal Regresyon
* Doğrusal Regresyon
* Ridge Regresyon
* Lasso Regresyon
* ElasticNet Regresyon

Bağımlı Değişken:
Salary: 1986-1987 sezonunda kazanılan maaş

Açıklayıcı Değişkenler:

Hitters Veri setinde yer alan değişkenlerin açıklanması

* AtBat: 1986-1987 sezonunda bir beyzbol sopası ile topa yapılan vuruş sayısı
* Hits: 1986-1987 sezonundaki isabet sayısı
* HmRun: 1986-1987 sezonundaki en değerli vuruş sayısı
* Runs: 1986-1987 sezonunda takımına kaç sayı kazandırdı
* RBI: Bir vurucunun vuruş yaptıgında kaç tane oyuncuya koşu yaptırdığı.
* Walks: Karşı oyuncuya kaç defa hata yaptırdığı
* Years: Oyuncunun major liginde kaç sene oynadığı
* CAtBat: Oyuncunun kariyeri boyunca kaç kez topa vurduğu
* CHits: Oyuncunun kariyeri boyunca kaç kez isabetli vuruş yaptığı
* CHmRun: Oyucunun kariyeri boyunca kaç kez en değerli vuruşu yaptığı
* CRuns: Oyuncunun kariyeri boyunca takımına kaç tane sayı kazandırdığı
* CRBI: Oyuncunun kariyeri boyunca kaç tane oyuncuya koşu yaptırdığı
* CWalks: Oyuncun kariyeri boyunca karşı oyuncuya kaç kez hata yaptırdığı
* League: Oyuncunun sezon sonuna kadar oynadığı ligi gösteren A ve N seviyelerine sahip bir faktör
* Division: 1986 sonunda oyuncunun oynadığı pozisyonu gösteren E ve W seviyelerine sahip bir faktör
* PutOuts: Oyun icinde takım arkadaşınla yardımlaşma
* Assits: 1986-1987 sezonunda oyuncunun yaptığı asist sayısı
* Errors: 1986-1987 sezonundaki oyuncunun hata sayısı
* Salary: Oyuncunun 1986-1987 sezonunda aldığı maaş(bin uzerinden)
* NewLeague: 1987 sezonunun başında oyuncunun ligini gösteren A ve N seviyelerine sahip bir faktör

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split, cross_val_score

In [None]:
df = pd.read_csv('../input/hitters/Hitters.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.tail()

In [None]:
df.columns

In [None]:
df.sort_values('Salary',ascending=False)

In [None]:
df.isnull().sum()

In [None]:
df['Experience'] = pd.cut(df['Years'],4)

pd.cut(df['Years'],4).value_counts()

In [None]:
df['Experience'] = pd.cut(df['Years'],[0,5,10,15,25],labels=[1,2,3,4])
df.groupby(['League','Division', 'Experience']).agg({'Salary':'mean'})

In [None]:
df['Salary'] = df['Salary'].fillna(df.groupby(['League','Division', 'Experience'])['Salary'].transform('mean'))

In [None]:
df.describe([0.01, 0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.99]).T

In [None]:
df.shape

In [None]:
df.head()

In [None]:
 

num_features = df.select_dtypes(['int64']).columns

for feature in num_features:

    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    
    IQR = Q3-Q1
    
    upper = Q3 + 1.5*IQR
    lower = Q1 - 1.5*IQR
    
    if df[(df[feature] > upper) | (df[feature] < lower)].any(axis=None):
        print(feature," : " + str(df[(df[feature] > upper) | (df[feature] < lower)].shape[0]))
    else:
        print(feature, " : 0")
        
  

In [None]:
df.shape

In [None]:
from sklearn.neighbors import LocalOutlierFactor

clf=LocalOutlierFactor(n_neighbors=20, contamination=0.1)
clf.fit_predict(df[num_features])
df_scores=clf.negative_outlier_factor_
df_scores= np.sort(df_scores)
df_scores[0:20]

In [None]:
sns.boxplot(df_scores);

In [None]:
threshold=np.sort(df_scores)[7]
print(threshold)
df = df.loc[df_scores > threshold]
df = df.reset_index(drop=True)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
cat_features = ['League','Division','NewLeague'] 
num_features = list(df.select_dtypes(['int64']).columns)

In [None]:
cat_features

In [None]:
corr = df.corr()

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(corr,annot=True)
plt.show()

In [None]:
for col in num_features:
    #sns.scatterplot(x=col ,y='Salary',data=df,hue='League')
    sns.jointplot(x =col, y = 'Salary', data = df, kind = "reg")
    plt.show()

In [None]:
df.groupby('League').mean().T

In [None]:
df.groupby('Division').mean().T

In [None]:
df.groupby('NewLeague').mean().T

In [None]:
for col in cat_features:
    print('Exploring {} feature'.format(col.upper()))
    print(df[col].value_counts(normalize=True,ascending=False))
    sns.barplot(x=col, y="Salary", data=df)
    plt.show()

In [None]:
sns.scatterplot(x=df['CHits']/df['Hits'] ,y='Salary',data=df,hue='League')
plt.show()

In [None]:
df.head()

In [None]:
df['Experience'] = pd.cut(df['Years'],[0,5,10,15,25],labels=[1,2,3,4])

In [None]:
df['Experience'].value_counts()

In [None]:
df.head()

In [None]:
sns.lineplot(x='Experience', y="Salary", data=df, estimator=np.mean)

In [None]:
df['CRBI_bins'] = pd.cut(df['CRBI'],4,labels=[1,2,3,4])

In [None]:
cat_features.extend(['Experience','CRBI_bins'])

In [None]:
cat_features

In [None]:
df.info()

In [None]:
df['New_HitRate']=df["CAtBat"]/df["CHits"]
df['New_AtBat']=df["CAtBat"]/df["AtBat"]
df['New_RBI']=df["CRBI"]/df["RBI"]
df['New_Walks']=df["CWalks"]/df["Walks"]
df['New_Hits']=df["CHits"]/df["Hits"]
df['New_HmRun']=df["CHmRun"]/df["HmRun"]
df['New_Runs']=df["CRuns"]/df["Runs"]
df['New_ChmrunRate']=df["CHmRun"]/df["CHits"]
df['New_Cat']=df["CAtBat"]/df["CRuns"]
df['New_Assist']=df["Hits"]/df["Assists"]
 

In [None]:
num_features.extend(['New_HitRate','New_RBI','New_Walks','New_Hits','New_HmRun','New_Runs','New_ChmrunRate','New_AtBat','New_Cat','New_Assist'])

In [None]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

In [None]:
df = pd.get_dummies(df, columns = cat_features, drop_first = True)

In [None]:
df.head()

In [None]:
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

std_scaler = StandardScaler()
min_max_scaler = MinMaxScaler()

df[num_features] = std_scaler.fit_transform(df[num_features])

In [None]:
df.head()

In [None]:
y = df["Salary"]
X = df.drop('Salary', axis=1)

In [None]:
from sklearn.feature_selection import RFECV #Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.
from sklearn.ensemble import RandomForestRegressor

def select_features(X,y):
    # numerik olmayan degiskenlerin silinmesi
    X = X.select_dtypes([np.number]).dropna(axis=1)
    
    clf = RandomForestRegressor(random_state=46)
    clf.fit(X, y)
    
    selector = RFECV(clf,cv=10)
    selector.fit(X, y)
    
    features = pd.DataFrame()
    features['Feature'] = X.columns
    features['Importance'] = clf.feature_importances_
    features.sort_values(by=['Importance'], ascending=False, inplace=True)
    features.set_index('Feature', inplace=True)
    features.plot(kind='bar', figsize=(12, 5))
    
    
    best_columns = list(X.columns[selector.support_])
    print("Best Columns \n"+"-"*12+"\n{}\n".format(best_columns))
    
    return best_columns

In [None]:
best_features = select_features(X,y)
best_features

In [None]:
X.head()

Model Traning

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20,random_state=46)

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import GridSearchCV

 Lineer Regression

In [None]:
lr_model = LinearRegression()
lr_model

In [None]:
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)

In [None]:
lr_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
lr_rmse

In [None]:
lr_cv_rmse =  np.sqrt(np.mean(-cross_val_score(lr_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error")))
lr_cv_rmse

In [None]:
np.sqrt(-cross_val_score(lr_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error"))

In [None]:
coefs = pd.DataFrame(lr_model.coef_, index = X_train.columns)
coefs

In [None]:
intercept = lr_model.intercept_
intercept

Ridge Regression

In [None]:
ridge_model = Ridge()
ridge_model

In [None]:
ridge_model.fit(X_train, y_train)
y_pred = ridge_model.predict(X_test)

In [None]:
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
ridge_rmse

In [None]:
ridge_cv_rmse =  np.sqrt(np.mean(-cross_val_score(ridge_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error")))
ridge_cv_rmse

In [None]:
np.sqrt(-cross_val_score(ridge_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error"))

In [None]:
pd.Series(ridge_model.coef_, index = X_train.columns)

Lasso Regression

In [None]:
lasso_model = Lasso()
lasso_model

In [None]:
lasso_model.fit(X_train,y_train)
y_pred = lasso_model.predict(X_test)

In [None]:
lasso_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
lasso_rmse

In [None]:
lasso_cv_rmse =  np.sqrt(np.mean(-cross_val_score(lasso_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error")))
lasso_cv_rmse 

In [None]:
np.sqrt(-cross_val_score(lasso_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error"))

In [None]:
pd.Series(lasso_model.coef_, index = X_train.columns)

ElasticNet Regression

In [None]:
elasticnet_model = ElasticNet()
elasticnet_model

In [None]:
elasticnet_model.fit(X_train, y_train)
y_pred = elasticnet_model.predict(X_test)

In [None]:
elasticnet_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
elasticnet_rmse

In [None]:
elasticnet_cv_rmse =  np.sqrt(np.mean(-cross_val_score(elasticnet_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error")))
elasticnet_cv_rmse 

In [None]:
np.sqrt(-cross_val_score(elasticnet_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error"))

In [None]:
pd.Series(elasticnet_model.coef_, index = X_train.columns)

Model Tuning

Ridge Regression Tuning

In [None]:
ridge_params = {'alpha' :10**np.linspace(10,-2,100)*0.5 ,
                'solver' : ['auto', 'svd', 'cholesky', 'lsqr']}
ridge_model = Ridge()
ridge_gridcv_model = GridSearchCV(estimator=ridge_model, param_grid=ridge_params, cv=10, n_jobs=-1, verbose=2).fit(X_train,y_train)
ridge_gridcv_model.best_params_

In [None]:
ridge_tuned_model = Ridge(**ridge_gridcv_model.best_params_)

In [None]:
ridge_tuned_model.fit(X_train, y_train)
y_pred = ridge_tuned_model.predict(X_test)

In [None]:
ridge_tuned_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
ridge_tuned_rmse

In [None]:
ridge_tuned_cv_rmse =  np.sqrt(np.mean(-cross_val_score(ridge_tuned_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error")))
ridge_tuned_cv_rmse 

In [None]:
ridge_model = Ridge()
coefs = []

for i in 10**np.linspace(10,-2,100)*0.5 :
    ridge_model.set_params(alpha = i)
    ridge_model.fit(X_train, y_train)
    y_pred = ridge_model.predict(X_test)
    print(mean_squared_error(y_test, y_pred, squared=False))

Lasso Regression Tuning

In [None]:
lasso_params = {'alpha':np.linspace(0,1,1000)}

lasso_model = Lasso(tol = 0.001)
lasso_gridcv_model = GridSearchCV(estimator=lasso_model, param_grid = lasso_params, cv=10, n_jobs=-1, verbose=2).fit(X_train,y_train)
lasso_gridcv_model.best_params_

In [None]:
lasso_tuned_model = Lasso(**lasso_gridcv_model.best_params_)

In [None]:
lasso_tuned_model.fit(X_train, y_train)
y_pred = lasso_tuned_model.predict(X_test)

In [None]:
lasso_tuned_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
lasso_tuned_rmse

In [None]:
lasso_tuned_cv_rmse =  np.sqrt(np.mean(-cross_val_score(lasso_tuned_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error")))
lasso_tuned_cv_rmse

ElasticNet Regression Tuning

In [None]:
elasticnet_params = {"l1_ratio": [0.1,0.4,0.5,0.6,0.8,1],
                     "alpha":[0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1],
                    }
elasticnet_model = ElasticNet()
elasticnet_gridcv_model = GridSearchCV(estimator=elasticnet_model, param_grid=elasticnet_params, cv=10, n_jobs=-1, verbose=2).fit(X_train,y_train)
elasticnet_gridcv_model.best_params_

In [None]:
elasticnet_tuned_model = ElasticNet(**elasticnet_gridcv_model.best_params_)

In [None]:
elasticnet_tuned_model.fit(X_train, y_train)
y_pred = elasticnet_tuned_model.predict(X_test)

In [None]:
elasticnet_tuned_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
elasticnet_tuned_rmse

In [None]:
elasticnet_tuned_cv_rmse =  np.sqrt(np.mean(-cross_val_score(elasticnet_tuned_model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error")))
elasticnet_tuned_cv_rmse 

Model Selecting

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import GridSearchCV

def select_model(X,y):
   
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20,random_state=46)
    
    models = [ 
        {
            "name": "RidgeRegression",
            "estimator": Ridge(),
            "hyperparameters":
                {
                 'alpha' :np.linspace(0,1,100) ,
                 'solver' : ['auto', 'svd', 'cholesky', 'lsqr']
                }
        },
        {
            "name": "LassoRegression",
            "estimator": Lasso(),
            "hyperparameters":
                {
                 'alpha' :np.linspace(0,1,100) ,
                }
        },
        {
            "name": "ElasticNetRegression",
            "estimator": ElasticNet(),
            "hyperparameters":
                {
                 "l1_ratio": np.linspace(0,1,30), # [0.1,0.4,0.5,0.6,0.8,1],
                 "alpha":np.linspace(0,1,100), # [0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]
                }
        },
       
    ]

    for model in models:
        print(model['name'])
        print('-'*len(model['name']))

        grid = GridSearchCV(model["estimator"],
                            param_grid=model["hyperparameters"],
                            cv=10,scoring="neg_mean_squared_error")
        grid.fit(X_train, y_train)
        
        model["best_params"] = grid.best_params_
        #model["best_score"] = grid.best_score_
        model["tuned_model"] = grid.best_estimator_
        
        model["train_rmse_score"] = np.sqrt(mean_squared_error(y_train, model["tuned_model"].fit(X_train,y_train).predict(X_train)))
        model["validation_rmse_score"] = np.sqrt(np.mean(-cross_val_score(model["tuned_model"], X_train, y_train, cv = 10, scoring = "neg_mean_squared_error")))
        model["test_rmse_score"] = np.sqrt(mean_squared_error(y_test, model["tuned_model"].fit(X_train,y_train).predict(X_test)))
      
        #print("Best ......... Score: {}".format(model["best_score"]))
        print("Best TRAIN RMSE Score: {}".format(model["train_rmse_score"]))
        print("Best VALIDATION RMSE Score: {}".format(model["validation_rmse_score"]))
        print("Best TEST RMSE Score: {}".format(model["test_rmse_score"]))
        print("Best Parameters: {}\n".format(model["best_params"]))


In [None]:
select_model(X,y)