Your task is to build a machine learning model that will help the company identify people who are more likely to donate and then try to predict the donation amount.

- Import the required libraries and modules that you would need.

- Read that data into Python and call the dataframe donors.

- Check the datatypes of all the columns in the data.

- Check for null values in the dataframe. Replace the null values using the methods learned in class.

- Split the data into numerical and catagorical. Decide if any columns need their dtype changed.


    - Split the data into a training set and a test set.
    - Scale the features either by using normalizer or a standard scaler.
    - Encode the categorical features using One-Hot Encoding or Ordinal Encoding
    - Fit a logistic regression model on the training data.
    - Check the accuracy on the test data.
    
Note: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model has changed.

SOLICITATION LIMIT CODE IN HOUSE 
      
=can be mailed (Default)    
00=Do Not Solicit    
01=one solicitation per year    
02=two solicitations per year     
03=three solicitations per year    
04=four solicitations per year    
05=five solicitations per year    
06=six solicitations per year    
12=twelve solicitations per year

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [2]:
numerical = pd.read_csv('/Users/szabonikolett/Desktop/Ironhack-Labs/numerical7_02.csv')
numerical = numerical.drop(['Unnamed: 0'], axis=1) # to get rid of the unnamed useless column
numerical

numerical.shape

(95412, 322)

In [3]:
categorical = pd.read_csv('/Users/szabonikolett/Desktop/Ironhack-Labs/categorical7_02.csv')
categorical = categorical.drop(['Unnamed: 0'],axis=1)
categorical

categorical.shape

(95412, 12)

In [4]:
target = pd.read_csv('/Users/szabonikolett/Desktop/Ironhack-Labs/target7_02.csv')
target = target.drop(['Unnamed: 0'],axis=1)
target

target.shape

(95412, 2)

In [5]:
categorical.isna().sum()

STATE           0
CLUSTER         0
HOMEOWNR        0
GENDER          0
DATASRCE        0
SOLIH       89212
VETERANS    84986
RFA_2R          0
RFA_2A          0
GEOCODE2        0
DOMAIN_A        0
DOMAIN_B        0
dtype: int64

In [6]:
categorical['SOLIH'].value_counts()

12.0    5693
0.0      296
1.0       94
2.0       75
3.0       19
4.0       16
6.0        7
Name: SOLIH, dtype: int64

In [7]:
categorical['SOLIH'] = categorical['SOLIH'].fillna(20) # if I put '20' it becomes a float, not an integer and cause problem with the encoder

In [8]:
categorical['VETERANS'].value_counts()

Y    10426
Name: VETERANS, dtype: int64

In [9]:
categorical['VETERANS'] = categorical['VETERANS'].fillna('N')

In [10]:
categorical.isna().sum()

STATE       0
CLUSTER     0
HOMEOWNR    0
GENDER      0
DATASRCE    0
SOLIH       0
VETERANS    0
RFA_2R      0
RFA_2A      0
GEOCODE2    0
DOMAIN_A    0
DOMAIN_B    0
dtype: int64

In [11]:
target.dtypes

TARGET_B      int64
TARGET_D    float64
dtype: object

In [12]:
numerical['AGE'].value_counts()#(dropna=False)

61.611649    23665
50.000000     1930
76.000000     1885
72.000000     1813
68.000000     1809
             ...  
9.000000         1
6.000000         1
10.000000        1
8.000000         1
15.000000        1
Name: AGE, Length: 97, dtype: int64

In [13]:
# might not be the best choice tho
numerical['AGE'] = numerical['AGE'].fillna(61.61) # replacing the only nan with the mode

In [14]:
# use target b and drop target d

### Split the data into a training set and a test set.

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X = pd.concat([numerical,categorical],axis=1)
y = target.drop(['TARGET_D'],axis=1)
# y = target['TARGET_B']

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

In [18]:
X_train_num = X_train.select_dtypes(np.number)
X_train_cat = X_train.select_dtypes(object)

X_test_num =  X_test.select_dtypes(np.number)
X_test_cat = X_test.select_dtypes(object)

### Scale the features either by using normalizer or a standard scaler.

In [19]:
#from sklearn.preprocessing import MinMaxScaler

#transformer = MinMaxScaler().fit(X_train_num) 
#X_train_num = transformer.transform(X_train_num) # saving result in variable, I only train here
#X_train_num = pd.DataFrame(X_train_num)

In [20]:
#from sklearn.preprocessing import OneHotEncoder # run only one time

#encoder = OneHotEncoder(drop='first', handle_unknown = 'ignore').fit(X_train_cat) # or handle_unknown=‘error’
#X_train_cat = encoder.transform(X_train_cat).toarray() # TRANSFORMING into an array; 2 dimensional array

#X_test_cat = encoder.transform(X_train_cat).toarray()

In [21]:
#X_train_scaled = np.concatenate((X_train_num, X_train_cat),axis=1)
#X_train_scaled = pd.DataFrame(X_train_scaled)
#X_train_scaled.head()

In [22]:
#X_test_scaled = np.concat((X_test_num, X_test_cat),axis=1)
#X_test_scaled = pd.DataFrame(X_test_scaled)
#X_test_scaled.head()


In [23]:
from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler().fit(X_train_num)
cols=transformer.get_feature_names_out(input_features=X_train_num.columns)

X_train_numscale = transformer.transform(X_train_num)
X_test_numscale = transformer.transform(X_test_num)

X_train_num = pd.DataFrame(X_train_numscale, columns=X_train_num.columns)
X_test_num = pd.DataFrame(X_test_numscale, columns=X_test_num.columns)

In [24]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='error', drop='first').fit(X_train_cat)
cols=encoder.get_feature_names_out(input_features=X_train_cat.columns)

X_train_cat = encoder.transform(X_train_cat).toarray()
X_test_cat = encoder.transform(X_test_cat).toarray()


X_train_cat = pd.DataFrame(X_train_cat,columns=cols)
X_test_cat = pd.DataFrame(X_test_cat,columns=cols)

In [25]:
X_train_scaled = pd.concat([X_train_num, X_train_cat],axis=1)
X_test_scaled = pd.concat([X_test_num, X_test_cat],axis=1)

###  Fit a logistic regression model on the training data.

In [29]:
%%time
from sklearn.linear_model import LogisticRegression
classification = LogisticRegression(random_state=0, solver='saga',
                  multi_class='multinomial').fit(X_train_scaled, y_train)

  y = column_or_1d(y, warn=True)


CPU times: user 54.2 s, sys: 945 ms, total: 55.2 s
Wall time: 57.7 s




### Check the accuracy on the test data.

In [30]:
%%time
predictions = classification.predict(X_test_scaled)
classification.score(X_test_scaled, y_test)

CPU times: user 138 ms, sys: 76.7 ms, total: 214 ms
Wall time: 232 ms


0.9496499392110007

In [None]:
# classification score 94%, pretty good for the test set

In [None]:
### dealing with imbalanced data

In [None]:
from sklearn.utils import resample


category_0_undersampled = resample(category_0, 
                                   replace=False, 
                                   n_samples = len(category_1))

In [28]:
#categorical['STATE'].value_counts()

In [31]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
%%time
max_depth = range(1,30)
test = []
train = []

for depth in max_depth:
    model = DecisionTreeClassifier(max_depth=depth, random_state=0)
    model.fit(X_train_scaled, y_train)
    test.append(model.score(X_test_scaled,y_test))
    train.append(model.score(X_train_scaled,y_train))

In [None]:
%%time
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot( train, label="training accuracy")
plt.plot( test, label="test accuracy")
plt.ylabel("Accuracy") # represents percentage 
plt.xlabel("n_depth-1")
plt.legend()

In [None]:
model.feature_importances_

In [None]:
%%time
def plot_feature_importances_cancer(model):
    n_features = cancer.data.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), cancer.feature_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")

plot_feature_importances_cancer(model)

### Experimenting with Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
%%time
regr = DecisionTreeRegressor(max_depth=4)

model = regr.fit(X_train_scaled, y_train)

In [None]:
print("test data R2 score was: ",regr.score(X_test, y_test)) # R2 score
print("train data R2 score was: ",regr.score(X_train, y_train))

In [None]:
list(features.columns)

In [None]:
from sklearn.tree import export_text

# regr was built in the cell DecisionTreeeRegressor

r = export_text(regr, feature_names=list(features.columns))
print(r)