Given the data set, do a quick exploratory data analysis to get a feel for the distributions and biases of the data.  Report any visualizations and findings used and suggest any other impactful business use cases for that data:

There is a lot of imbalance in the distribution of peoples year that they are in with
((2719 + 2273)/(2719 + 2273+5+3) = .9984) 99.84% of the people chosen being year 2 and year 3. In addition, the vast majority of the data came from these three schools Butler University (1614/5000), Indiana State University (1309/5000), and Ball State University (1085/5000). This is also shown to be visualized in the majors, with there not being more than 76 people in the bottom 9 majors shown (like Psychology and civil engineering). This makes a bias by there being an over representation of people in year 2 and 3 or people from those specific universities/majors and causes the dataset to primarily reflect the characteristics of individuals from these academic years and institutions and not represent a year one as well. Given this data, the more popular universities, like butler, should be focused as they have more data surrounding it. Also if other universities have the possibility of being added, they should consider factors like the proportions of people that are year 2 and year 3 and if there major is a popular major in the dataset like chem.

Consider implications of data collection, storage, and data biases you would consider relevant here considering Data Ethics, Business Outcomes, and Technical Implications

Discuss Ethical implications of these factors:

    - Data Collection:
        + Data could be collected in a way where a certain group of people are not collected and lead to an issue
    - storage
        + If the data is not stored properly, user information could be leaked 
    - biases
        + A bias could lead to particular groups of people to not be represented properly and could create problems
        
Discuss Business outcome implications of these factors:

    - Data Collection:
        + A person can lie about their information and lead to the model predict on information that is not true. 
    - storage
        + If the data isnt stored securely, someone can use this and put in information that leads to another option, or they could change data that is already there, which could remove possible patterns in the data.
    - biases
        + A data bias could lead to a less reliable model as it could not represent the desired population

Discuss Technical implications of these factors

    - Data Collection:
        + make sure that data is collected in a way that gets the desired group of people to get a complete overview of the population
    - storage
        + Steps should be taken so only people who should see the data, gets to see it
    - biases
        + Systems could be in place to see whether biases are showing up in one way or the other

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
import joblib


In [2]:
fileName = "XTern 2024 Artificial Intelegence Data Set - Xtern_TrainData.csv"
data = pd.read_csv(fileName)

In [3]:
yearDist = data['Year'].value_counts().to_dict()
majorDist = data['Major'].value_counts().to_dict()
universityDist = data['University'].value_counts().to_dict()
timeDist = data['Time'].value_counts().to_dict()


print("distribution of peoples year:")
for element, count in yearDist.items():
    print(f"{element}: {count}")

print("\ndistribution of peoples Major:")
for element, count in majorDist.items():
    print(f"{element}: {count}")

print("\ndistribution of peoples Uni:")
for element, count in universityDist.items():
    print(f"{element}: {count}")

print("\ndistribution of peoples Time:")
for element, count in timeDist.items():
    print(f"{element}: {count}")


distribution of peoples year:
Year 3: 2719
Year 2: 2273
Year 1: 5
Year 4: 3

distribution of peoples Major:
Chemistry: 640
Biology: 635
Astronomy: 619
Physics: 610
Mathematics: 582
Economics: 511
Business Administration: 334
Political Science: 309
Marketing: 239
Anthropology: 146
Finance: 135
Psychology: 76
Accounting: 62
Sociology: 31
International Business: 29
Music: 21
Mechanical Engineering: 11
Philosophy: 4
Fine Arts: 3
Civil Engineering: 3

distribution of peoples Uni:
Butler University: 1614
Indiana State University: 1309
Ball State University: 1085
Indiana University-Purdue University Indianapolis (IUPUI): 682
University of Notre Dame: 144
University of Evansville: 143
Indiana University Bloomington: 12
Valparaiso University: 9
Purdue University: 1
DePauw University: 1

distribution of peoples Time:
13: 1316
12: 1314
14: 883
11: 857
15: 282
10: 247
16: 49
9: 40
8: 8
17: 4


In [4]:
x = data.drop('Order', axis=1)
y = data['Order']

enc = LabelEncoder()
xEncoded = x.apply(lambda col: enc.fit_transform(col) if col.dtype == 'O' else col)

# splits data into train and test split (90 10 split)
xTrain, xTest, yTrain, yTest = train_test_split(xEncoded, y, test_size=0.1, random_state=42)

In [5]:
# sets the options to go through
param_grid = {
    'C': [5, 10, 15, 20],
    'kernel': ['rbf'],
    'gamma': ['auto']
}
# goes through each of the options looking for the combination that gives the best accuracy
svmClassifier = SVC(random_state=42)
gridSearch = GridSearchCV(svmClassifier, param_grid, cv=5, scoring='accuracy')
gridSearch.fit(xTrain, yTrain)


bestOptions = gridSearch.best_params_ # displays the best parameters
print(f'The best options are:  {bestOptions}')

# After the best parameters are found, the final model is created with the best parameters found
svmBest = SVC(random_state=42, **bestOptions, probability=True)
svmBest.fit(xTrain, yTrain)# model is trained on the training data

# After the model is created and trained, it is tested on the test data for the accuracy
testPredictions = svmBest.predict(xTest) 

testAccuracy = accuracy_score(yTest, testPredictions) # gets and displays accuracy
print(f'The test accuracy is: {testAccuracy}')

The best options are:  {'C': 15, 'gamma': 'auto', 'kernel': 'rbf'}
The test accuracy is: 0.67


In [6]:
joblib.dump(svmBest, "svm_model.pkl")

['svm_model.pkl']

In [7]:
# displays the confusion matrix of the results
confusion_mat = confusion_matrix(yTest, testPredictions)
print(f'Confusion Matrix:\n{confusion_mat}')

Confusion Matrix:
[[29  3  3  0  1  7  0  4  0  2]
 [ 4 37  1  0  4  6  2  4  0  1]
 [ 0  8 34  1  0  1  0  0  4  2]
 [ 0  0  0 32  4  0  1  5  9  0]
 [ 0 10  1  5 25  2  0  2  0  5]
 [ 0  0  0  0  0 36  3  0  0  1]
 [ 0  2  1  0  1  1 46  0  0  0]
 [ 2  8  0  8  1  4  1 32  0  0]
 [ 0  2  3  3  0  0  2  5 31  2]
 [ 2  4  2  0  2  3  0  0  0 33]]


In [8]:
timeDist = data['Order'].value_counts().to_dict()

print("\ndistribution of peoples Orders:")
for element, count in timeDist.items():
    print(f"{element}: {count}")


distribution of peoples Orders:
Sugar Cream Pie: 512
Indiana Pork Chili: 510
Cornbread Hush Puppies: 510
Sweet Potato Fries: 508
Ultimate Grilled Cheese Sandwich (with bacon and tomato): 503
Indiana Buffalo Chicken Tacos (3 tacos): 496
Indiana Corn on the Cob (brushed with garlic butter): 495
Breaded Pork Tenderloin Sandwich: 494
Fried Catfish Basket: 490
Hoosier BBQ Pulled Pork Sandwich: 482


what considerations would you make to determine if this is a suitable course of action?

Given that the test accuracy of 67 percent, that means that as time passes 1 of every 3 people will get a 10 percent discount on their order. So in order for it to be worth it, the promotion needs to attract enough new users to counteract the loss of revenue from the discounts given out.