## TITLE: Analysis of User Knowledge Modeling - Predicting an individuals intelligence level

### INTRODUCTION:

Understanding knowledge is key for assessing an individual's ability to learn and apply new information. Generally, this is a challenging problem due to the fact that many people are skilled in varied fields, and thus don't necessarily have a clear idea as to how they stack up among others. Using different classifications ranging from "very_low" to "high", This data set contains a variety of potential predictor values that may help us predict an individual's knowledge level. In this analysis, we'll ask the following question: "Which two attributes in this data set are best predictor values to help us figure out a user's knowledge level?" We'll use the following data set: https://archive.ics.uci.edu/static/public/257/user+knowledge+modeling.zip
This data set has 5 attributes: STG (The degree of study time for goal object materials), SCG (The degree of repetition number of user for goal object materials), STR (The degree of study time of user for related objects with goal object), LPR (The exam performance of user for related objects with goal object), PEG (The exam performance of user for goal objects), UNS (The knowledge level of user).

### Methods and Results:

#### Set up: 
Initially, we started by including the necessary imports to be able to read our file remotely from within the ZIP file. We then dropped the columns that would not affect our analysis such as the unnamed columns and the attribute information column. We ensured that our data file is tidy because each attribute was in its own column and each observation in its own cell. The final step of our set up was to check if we had any missing or NaN values in our dataframe; this was not the case. These steps ensured that our dataframe was ready to work with.

In [1]:
#a few extra things need to be imported in order to remotely read the excel file within the zip file
import requests
from io import BytesIO
import pandas as pd
from zipfile import ZipFile

#finding and reading the excel file
zip_file_url = 'https://archive.ics.uci.edu/static/public/257/user+knowledge+modeling.zip'
response = requests.get(zip_file_url)
zip_file = ZipFile(BytesIO(response.content))
excel_file = zip_file.namelist()[0]
initial_read = pd.read_excel(zip_file.open(excel_file), sheet_name='Training_Data')

#dropping some extra columns not relevant to our analysis. knowledge_training should output just the data and the classification
knowledge_training_initial = initial_read.drop(columns=['Unnamed: 6', 'Unnamed: 7', 'Attribute Information:'])
#further dropping columns that contain data, but are not to be used as predictor variables (everything but STG and PEG, study time and exam performance)
# knowledge_training = knowledge_training_initial.drop(columns=['SCG', 'STR', 'LPR'])

#renaming columns to be legible
knowledge_training = knowledge_training_initial.rename(columns={'STG': 'study_time',
                                                                'SCG' : 'repetition_by_user',
                                                                'STR' : 'study_time_for_related_subjects',
                                                                'LPR' : 'exam_performance_for_related_subjects',
                                                                'PEG': 'exam_performance',
                                                                ' UNS': 'knowledge'})

In [2]:
check_nan = knowledge_training.isnull().values.any()
print(check_nan)

False


In [3]:
any_missing = knowledge_training.isna().any().any()
print(any_missing)

False


In [4]:
knowledge_training

Unnamed: 0,study_time,repetition_by_user,study_time_for_related_subjects,exam_performance_for_related_subjects,exam_performance,knowledge
0,0.00,0.00,0.00,0.00,0.00,very_low
1,0.08,0.08,0.10,0.24,0.90,High
2,0.06,0.06,0.05,0.25,0.33,Low
3,0.10,0.10,0.15,0.65,0.30,Middle
4,0.08,0.08,0.08,0.98,0.24,Low
...,...,...,...,...,...,...
253,0.61,0.78,0.69,0.92,0.58,High
254,0.78,0.61,0.71,0.19,0.60,Middle
255,0.54,0.82,0.71,0.29,0.77,High
256,0.50,0.75,0.81,0.61,0.26,Middle


Next, we will create a correlation matrix of our dataframe to quantify the linear relationship between the attributes in our dataframe. Since our target attribute of prediction is a categorical one, it was easiest to encode our knowledge levels to be able to output a reasonable correlation matrix. The encoded dataframe and the correlation matrix can be seen below. It is clear that some attributes are less correlated with knowledge level than others; for that reason we chose to eliminate any attributes that have a consistent correlation of 0.15 or lower (absolute value). From the correlation matrix, we can see that the "study_time_for_related_subjects" attribute scored consistently between -0.15 and 0.15; therefore it was eliminated from our analysis as it didn't appear to have correlation with knowledge levels.

In [5]:
knowledge_training_encoded = pd.get_dummies(knowledge_training, columns=['knowledge'], prefix='KnowledgeLevel')
knowledge_training_encoded

Unnamed: 0,study_time,repetition_by_user,study_time_for_related_subjects,exam_performance_for_related_subjects,exam_performance,KnowledgeLevel_High,KnowledgeLevel_Low,KnowledgeLevel_Middle,KnowledgeLevel_very_low
0,0.00,0.00,0.00,0.00,0.00,0,0,0,1
1,0.08,0.08,0.10,0.24,0.90,1,0,0,0
2,0.06,0.06,0.05,0.25,0.33,0,1,0,0
3,0.10,0.10,0.15,0.65,0.30,0,0,1,0
4,0.08,0.08,0.08,0.98,0.24,0,1,0,0
...,...,...,...,...,...,...,...,...,...
253,0.61,0.78,0.69,0.92,0.58,1,0,0,0
254,0.78,0.61,0.71,0.19,0.60,0,0,1,0
255,0.54,0.82,0.71,0.29,0.77,1,0,0,0
256,0.50,0.75,0.81,0.61,0.26,0,0,1,0


In [6]:
correlation_matrix = knowledge_training_encoded.corr()
print(correlation_matrix)

                                       study_time  repetition_by_user  \
study_time                               1.000000            0.081035   
repetition_by_user                       0.081035            1.000000   
study_time_for_related_subjects          0.040841            0.083732   
exam_performance_for_related_subjects    0.099543            0.097816   
exam_performance                         0.206359            0.182792   
KnowledgeLevel_High                      0.136785            0.181403   
KnowledgeLevel_Low                      -0.164088           -0.060793   
KnowledgeLevel_Middle                    0.098838            0.041648   
KnowledgeLevel_very_low                 -0.099734           -0.238506   

                                       study_time_for_related_subjects  \
study_time                                                    0.040841   
repetition_by_user                                            0.083732   
study_time_for_related_subjects                

In [7]:
knowledge_training_final = knowledge_training.drop(columns=['study_time_for_related_subjects'])
knowledge_training_final

Unnamed: 0,study_time,repetition_by_user,exam_performance_for_related_subjects,exam_performance,knowledge
0,0.00,0.00,0.00,0.00,very_low
1,0.08,0.08,0.24,0.90,High
2,0.06,0.06,0.25,0.33,Low
3,0.10,0.10,0.65,0.30,Middle
4,0.08,0.08,0.98,0.24,Low
...,...,...,...,...,...
253,0.61,0.78,0.92,0.58,High
254,0.78,0.61,0.19,0.60,Middle
255,0.54,0.82,0.29,0.77,High
256,0.50,0.75,0.61,0.26,Middle


Next, we choose to show some statistics about our dataframe. Namely, the split of the different knowledge levels that are present in our dataframe.

In [8]:
100 * knowledge_training_final.groupby('knowledge').size() / knowledge_training_final.shape[0]

knowledge
High        24.418605
Low         32.170543
Middle      34.108527
very_low     9.302326
dtype: float64

In [9]:
knowledge_training_final['knowledge'].value_counts()

Middle      88
Low         83
High        63
very_low    24
Name: knowledge, dtype: int64

The next step is to create visualizations between each pair of attributes in our dataframe to help identify an existing correlation, if there is one.

### Study Time vs. Repetition by User:

In [10]:
import altair as alt

studyTime_repetitionByUser_plot = alt.Chart(knowledge_training_final, title = "Figure 1: Study Time vs. Repetition by User").mark_circle().encode(
    x=alt.X("study_time:Q", title="Study Time"),  
    y=alt.Y("repetition_by_user:Q", title="Repetition by User"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
studyTime_repetitionByUser_plot

### Study Time vs. Exam Performance for Related Subjects:

In [11]:
studyTime_examPerfForRelatedSubj_plot = alt.Chart(knowledge_training_final, title = "Figure 2: Study Time vs. Exam Performance for Related Subject").mark_circle().encode(
    x=alt.X("study_time:Q", title="Study Time"),  
    y=alt.Y("exam_performance_for_related_subjects:Q", title="Exam Performance for Related Subjects"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
studyTime_examPerfForRelatedSubj_plot

### Study Time vs. Exam Performance:

In [12]:
studyTime_examPerf_plot = alt.Chart(knowledge_training_final, title = "Figure 3: Study Time vs. Exam Performance").mark_circle().encode(
    x=alt.X("study_time:Q", title="Study Time"),  
    y=alt.Y("exam_performance:Q", title="Exam Performance"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
studyTime_examPerf_plot

### Repetition by User vs. Exam Performance for Related Subjects:

In [13]:
repByUser_examPerfRelatedTopics_plot = alt.Chart(knowledge_training_final, title = "Figure 4: Repetition by User vs. Exam Performance for Related Subjects").mark_circle().encode(
    x=alt.X("repetition_by_user:Q", title="Repetition by User"),  
    y=alt.Y("exam_performance_for_related_subjects:Q", title="Exam Performance for Related Subjects"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
repByUser_examPerfRelatedTopics_plot

### Repetition by User vs. Exam Performance:

In [14]:
repByUser_examPerf_plot = alt.Chart(knowledge_training_final, title = "Figure 5: Repetition by User vs. Exam Performance").mark_circle().encode(
    x=alt.X("repetition_by_user:Q", title="Repetition by User"),  
    y=alt.Y("exam_performance:Q", title="Exam Performance"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
repByUser_examPerf_plot

### Exam Performance for Related Subjects vs. Exam Performance:

In [15]:
examPerf_examPerfRelatedTopics_plot = alt.Chart(knowledge_training_final, title = "Figure 6: Repetition by User vs. Exam Performance").mark_circle().encode(
    x=alt.X("exam_performance_for_related_subjects:Q", title="Exam Performance for Related Subjects"),  
    y=alt.Y("exam_performance:Q", title="Exam Performance"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
examPerf_examPerfRelatedTopics_plot

### Takeaways from Attribute Visualizations:
- It appears that the only correlations worth noting are the ones in Figure 3 and Figure 6 above.
- Figure 3 appears to display a weak positive correlation between Study Time and Exam Performance.
- Figure 6 appears to display a weak negative correlation between Exam Performance for Related Subjects and Exam Performance.

We will focus on these four attributes to create testing and training data to best predicts the subject's knowledge level. The results will be outlined in the discussion section.

In [24]:
#Creating train/test split
from sklearn.model_selection import train_test_split

knowledge_train, knowledge_test = train_test_split(knowledge_training_final, test_size=0.25, random_state=123)
knowledge_train

Unnamed: 0,study_time,repetition_by_user,exam_performance_for_related_subjects,exam_performance,knowledge
107,0.305,0.255,0.40,0.54,Middle
89,0.290,0.300,0.09,0.67,Middle
85,0.248,0.300,0.20,0.03,very_low
10,0.180,0.180,0.30,0.81,High
26,0.040,0.280,0.25,0.10,very_low
...,...,...,...,...,...
106,0.260,0.280,0.29,0.59,Middle
83,0.250,0.290,0.48,0.26,Low
17,0.100,0.250,0.08,0.33,Low
230,0.730,0.430,0.12,0.65,Middle


In [23]:
knowledge_test

Unnamed: 0,study_time,repetition_by_user,exam_performance_for_related_subjects,exam_performance,knowledge
30,0.120,0.245,0.31,0.59,Middle
100,0.270,0.280,0.48,0.26,Low
90,0.258,0.280,0.29,0.56,Middle
197,0.730,0.200,0.72,0.26,Low
198,0.780,0.150,0.18,0.63,Middle
...,...,...,...,...,...
205,0.620,0.140,0.81,0.15,Low
239,0.520,0.440,0.30,0.52,Middle
95,0.255,0.305,0.62,0.15,Low
23,0.180,0.310,0.42,0.28,Low
