# **臺灣現存大專院校未來退場機率預測**

> **TAICA 人工智慧導論 期末專題**

* 國立彰化師範大學 資訊工程學系 S1254059 吳佳泰
* 國立彰化師範大學 資訊工程學系 S1454007 陳佳君

---

### **使用說明**
* 使用 Google Colab 執行前需先掛接 [university_data.csv](https://drive.google.com/file/d/13HyOXm4acnnKeNu2lM3rexNBt0cvwlo1/view?usp=sharing) 與 [university_data_test.csv](https://drive.google.com/file/d/1mWUKdGfs2Nkdl5hOlrkLQ88EVoSwKL_3/view?usp=sharing) 檔案
* 執行完畢會輸出 **113_data_prediction_result.csv** 檔案

## Step1. 安裝所需套件

In [None]:
!pip install pandas
!pip install scikit-learn
!pip install numpy



## Step2. 匯入所需模組與資料


In [1]:
import pandas
import numpy

from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

university_data = pandas.read_csv("university_data.csv")
university_data_test = pandas.read_csv("university_data_test.csv")

# leave school_name to predict output data :)
test_school_names = university_data_test["school_name"].copy()

## Step3. 資料預處理

* 去除無用欄位
* 針對學校總類編號

In [2]:
# all columns : id,data_year,school_name,private_flag,school_type,school_region,urbanization_level,enrollment_quota,new_student_count,enrollment_rate,tuition_revenue_ratio,debt_ratio,net_income_ratio,totur_flag,closure_flag
# need columns: private_flag,school_type,school_region,urbanization_level,enrollment_quota,new_student_count,enrollment_rate,tuition_revenue_ratio,debt_ratio,net_income_ratio,totur_flag,closure_flag

# drop useless data
university_data = university_data.drop(columns=["id", "data_year", "school_name", "school_region"])
university_data_test = university_data_test.drop(columns=["id", "data_year", "school_name", "school_region"])

# encode school_type
encode = LabelEncoder()
university_data["school_type"] = encode.fit_transform(university_data["school_type"])
university_data_test["school_type"] = encode.transform(university_data_test["school_type"])

## Step4. 切分資料集與訓練集

In [3]:
X = university_data.drop("closure_flag", axis=1).values  # feature
y = university_data["closure_flag"].values  # label

test_X = university_data_test.drop("closure_flag", axis=1).values # feature
test_y = university_data_test["closure_flag"].values  # label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

## Step5. 下採樣

In [4]:
# merge data
df = pandas.DataFrame(X, columns=["private_flag", "school_type", "urbanization_level", "enrollment_quota", "new_student_count", "stability", "enrollment_rate", "tuition_revenue_ratio", "debt_ratio", "net_income_ratio", "totur_flag"])
df['closure_flag'] = y

# separate
df_closure = df[df['closure_flag'] == 1]
df_normal = df[df['closure_flag'] == 0]

# undersampling
df_normal_sampled = df_normal.sample(n=len(df_closure)*3, random_state=42)

# balance
df_balanced = pandas.concat([df_closure, df_normal_sampled]).sample(frac=1, random_state=42)

# sperate again
X_balanced = df_balanced.drop('closure_flag', axis=1).values
y_balanced = df_balanced['closure_flag'].values

X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.2, stratify=y_balanced, random_state=0)

## Step6. 建立模型

In [5]:
random_forest_model = RandomForestClassifier(n_estimators=200, class_weight="balanced")
random_forest_model.fit(X_train, y_train)

## Step7. 預測並驗證分析模型
* 普通評分 `random_forest_model.score()`
* 混淆矩陣 `metrics.classification_report()`
* 特徵重要性分析 `random_forest_model.feature_importances_`

In [6]:
# score
accuracy = random_forest_model.score(X_test, y_test)
print("Accuracy", accuracy)

print("\n---------------------------------------------------\n")

# matrix
predict = random_forest_model.predict(X_test)
print(metrics.classification_report(y_test, predict))

print("\n---------------------------------------------------\n")

# feature importance
importances = random_forest_model.feature_importances_
features = university_data.drop("closure_flag", axis=1).columns
for f, imp in zip(features, importances):
    print(f"{f}: {imp:.4f}")

Accuracy 0.8571428571428571

---------------------------------------------------

              precision    recall  f1-score   support

           0       0.91      0.91      0.91        11
           1       0.67      0.67      0.67         3

    accuracy                           0.86        14
   macro avg       0.79      0.79      0.79        14
weighted avg       0.86      0.86      0.86        14


---------------------------------------------------

private_flag: 0.0059
school_type: 0.0055
urbanization_level: 0.0545
enrollment_quota: 0.2045
new_student_count: 0.3177
stability: 0.0987
enrollment_rate: 0.1899
tuition_revenue_ratio: 0.0391
debt_ratio: 0.0393
net_income_ratio: 0.0344
totur_flag: 0.0105


## Step8. 預測新資料並輸出

In [7]:
test_pred = random_forest_model.predict(test_X)
test_pred_proba = random_forest_model.predict_proba(test_X)[:, 1]

result = pandas.DataFrame({"School_Name": test_school_names, "True_Label": test_y, "Predicted_Label": test_pred, "Closure_Probability": test_pred_proba})

# reslit explanation
result["Prediction_Result"] = result.apply(lambda row: "correct" if row["True_Label"] == row["Predicted_Label"] else "error", axis=1)
result["Risk_Level"] = result["Closure_Probability"].apply(lambda x: "high" if x >= 0.7 else ("midterm" if x >= 0.4 else "low"))
result = result.sort_values('Closure_Probability', ascending=False)

# output file
print(result)
result.to_csv("113_data_prediction_results.csv", index=False, encoding="utf-8-sig")
print("All data is saved. You can check it in 113_data_prediction_results.csv.")

print("\n---------------------------------------------------\n")

# output score
print("Closure = 1 in 113 data：", numpy.sum(test_y == 1),)
print("Closure = 1 in our model：", numpy.sum(test_pred == 1))
print("Correct predict closure school：", numpy.sum((test_y == 1) & (test_pred == 1)))

    School_Name  True_Label  Predicted_Label  Closure_Probability  \
67         玄奘大學           0                1                0.920   
127  聖母醫護管理專科學校           0                1                0.825   
128     聖約翰科技大學           0                1                0.765   
113  慈惠醫護管理專科學校           0                1                0.760   
68         真理大學           0                1                0.750   
..          ...         ...              ...                  ...   
30     國立高雄師範大學           0                0                0.000   
70       臺北醫學大學           0                0                0.000   
27     國立陽明交通大學           0                0                0.000   
24     國立臺灣藝術大學           0                0                0.000   
0        國立中央大學           0                0                0.000   

    Prediction_Result Risk_Level  
67              error       high  
127             error       high  
128             error       high  
113             error       hig