### ***Wczytanie danych***
https://www.kaggle.com/datasets/mylesoneill/world-university-rankings

Dane które wybraliśmy dotyczą rankingu uczelni wyższych. Celem analizy jest wyłonić najlepszy uniwersytet w ramach całego świata.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
import joblib


df = pd.read_csv("cwurData.csv", index_col=False)

df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


### ***Opis zmiennych***

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   world_rank            2200 non-null   int64  
 1   institution           2200 non-null   object 
 2   country               2200 non-null   object 
 3   national_rank         2200 non-null   int64  
 4   quality_of_education  2200 non-null   int64  
 5   alumni_employment     2200 non-null   int64  
 6   quality_of_faculty    2200 non-null   int64  
 7   publications          2200 non-null   int64  
 8   influence             2200 non-null   int64  
 9   citations             2200 non-null   int64  
 10  broad_impact          2000 non-null   float64
 11  patents               2200 non-null   int64  
 12  score                 2200 non-null   float64
 13  year                  2200 non-null   int64  
dtypes: float64(2), int64(10), object(2)
memory usage: 240.8+ KB


###*Podział zmiennych na kategoryczne / numeryczne*

In [3]:
categorical = [var for var in df.columns if df[var].dtype == 'O']
numerical = [var for var in df.columns if df[var].dtype != 'O']

df[categorical].isnull().sum()

institution    0
country        0
dtype: int64

In [4]:
df[numerical].isnull().sum()

world_rank                0
national_rank             0
quality_of_education      0
alumni_employment         0
quality_of_faculty        0
publications              0
influence                 0
citations                 0
broad_impact            200
patents                   0
score                     0
year                      0
dtype: int64

###*Usunięcie rekordów które zawierają niepoprawną wartość w kolumnie 'broad_impact'*

In [5]:
df = df.dropna(axis=0, subset=['broad_impact'])
df[numerical].isnull().sum()

world_rank              0
national_rank           0
quality_of_education    0
alumni_employment       0
quality_of_faculty      0
publications            0
influence               0
citations               0
broad_impact            0
patents                 0
score                   0
year                    0
dtype: int64

###*Teraz nasz data-set jest już oczyszczony ze zbędnych rekordów*

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 200 to 2199
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   world_rank            2000 non-null   int64  
 1   institution           2000 non-null   object 
 2   country               2000 non-null   object 
 3   national_rank         2000 non-null   int64  
 4   quality_of_education  2000 non-null   int64  
 5   alumni_employment     2000 non-null   int64  
 6   quality_of_faculty    2000 non-null   int64  
 7   publications          2000 non-null   int64  
 8   influence             2000 non-null   int64  
 9   citations             2000 non-null   int64  
 10  broad_impact          2000 non-null   float64
 11  patents               2000 non-null   int64  
 12  score                 2000 non-null   float64
 13  year                  2000 non-null   int64  
dtypes: float64(2), int64(10), object(2)
memory usage: 234.4+ KB


###*Podział na zbiory treningowy, walidacyjny i testowy*

In [7]:
X = df.drop(['world_rank'], axis=1)
y = df['world_rank']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.3, random_state=42)

###*Wybór kolumn tekstowych, które mają być zakodowane*


In [8]:
categorical_cols = [var for var in X.columns if X[var].dtype == 'O']

###*Inicjalizacja obiektu kodowania gorących jednostek*

In [9]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

###*Dopasowanie i przekształcenie danych tekstowych za pomocą kodowania gorących jednostek*


In [None]:
X_train_encoded = encoder.fit_transform(X_train[categorical_cols])
X_val_encoded = encoder.transform(X_val[categorical_cols])
X_test_encoded = encoder.transform(X_test[categorical_cols])

###*Pozbycie się kolumn tekstowych z danych wejściowych*

In [11]:
X_train_numeric = X_train.drop(categorical_cols, axis=1)
X_val_numeric = X_val.drop(categorical_cols, axis=1)
X_test_numeric = X_test.drop(categorical_cols, axis=1)

###*Połączenie przekształconych danych z kodowaniem gorących jednostek z danymi numerycznymi*

In [12]:
X_train_final = np.hstack((X_train_numeric, X_train_encoded))
X_val_final = np.hstack((X_val_numeric, X_val_encoded))
X_test_final = np.hstack((X_test_numeric, X_test_encoded))

###*Inicjalizacja modelu regresji losowego lasu (Random Forest Regressor)*

In [13]:
model = RandomForestRegressor(random_state=42)

###*Trenowanie modelu na danych treningowych*

In [14]:
model.fit(X_train_final, y_train)

###*Przewidywanie na danych walidacyjnych*

In [15]:
predictions_val = model.predict(X_val_final)


###*Obliczanie błędu średniokwadratowego (MSE) jako miary jakości modelu na danych walidacyjnych*

In [16]:
mse_val = mean_squared_error(y_val, predictions_val)
print("Mean Squared Error on Validation Data:", mse_val)

Mean Squared Error on Validation Data: 219.64705023809523


###*Przewidywanie na danych testowych*

In [17]:
predictions_test = model.predict(X_test_final)

###*Obliczanie błędu średniokwadratowego (MSE) jako miary jakości modelu na danych testowych*

In [18]:
mse_test = mean_squared_error(y_test, predictions_test)
print("Mean Squared Error on Test Data:", mse_test)

Mean Squared Error on Test Data: 274.2114888888888


###*Zapisanie modelu do pliku*

In [19]:
joblib.dump(model, 'random_forest_model.pkl')
print("Model saved as 'random_forest_model.pkl'")

Model saved as 'random_forest_model.pkl'
