## TITLE: Analysis of User Knowledge Modeling - Predicting an individuals intelligence level

### INTRODUCTION: 

Understanding knowledge is key for assessing an individual's ability to learn and apply new information. Generally, this is a challenging problem due to the fact that many people are skilled in varied fields, and thus don't necessarily have a clear idea as to how they stack up among others. Using different classifications ranging from "very_low" to "high", This data set contains a variety of potential predictor values that may help us predict an individual's knowledge level. In this analysis, we'll ask the following question: "Which two attributes in this data set are best predictor values to help us figure out a user's knowledge level?" We'll use the following data set: https://archive.ics.uci.edu/static/public/257/user+knowledge+modeling.zip
This data set has 5 attributes: STG (The degree of study time for goal object materials), SCG (The degree of repetition number of user for goal object materials), STR (The degree of study time of user for related objects with goal object), LPR (The exam performance of user for related objects with goal object), PEG (The exam performance of user for goal objects), UNS (The knowledge level of user).

### Methods and Results:

#### Set up: 
Initially, we started by including the necessary imports to be able to read our file remotely from within the ZIP file. We then dropped the columns that would not affect our analysis such as the unnamed columns and the attribute information column. We ensured that our data file is tidy because each attribute was in its own column and each observation in its own cell. The final step of our set up was to check if we had any missing or NaN values in our dataframe; this was not the case. These steps ensured that our dataframe was ready to work with.

In [None]:
#a few extra things need to be imported in order to remotely read the excel file within the zip file
import requests
from io import BytesIO
import pandas as pd
from zipfile import ZipFile

#finding and reading the excel file
zip_file_url = 'https://archive.ics.uci.edu/static/public/257/user+knowledge+modeling.zip'
response = requests.get(zip_file_url)
zip_file = ZipFile(BytesIO(response.content))
excel_file = zip_file.namelist()[0]
initial_read = pd.read_excel(zip_file.open(excel_file), sheet_name='Training_Data')

#dropping some extra columns not relevant to our analysis. knowledge_training should output just the data and the classification
knowledge_training_initial = initial_read.drop(columns=['Unnamed: 6', 'Unnamed: 7', 'Attribute Information:'])
#further dropping columns that contain data, but are not to be used as predictor variables (everything but STG and PEG, study time and exam performance)
# knowledge_training = knowledge_training_initial.drop(columns=['SCG', 'STR', 'LPR'])

#renaming columns to be legible -- UNS didn't change for some reason idk why
knowledge_training = knowledge_training_initial.rename(columns={'STG': 'study_time',
                                                                'SCG' : 'repetition_by_user',
                                                                'STR' : 'study_time_for_related_subjects',
                                                                'LPR' : 'exam_performance_for_related_subjects',
                                                                'PEG': 'exam_performance',
                                                                ' UNS': 'knowledge'})

In [None]:
check_nan = knowledge_training.isnull().values.any()
print(check_nan)

In [None]:
any_missing = knowledge_training.isna().any().any()
print(any_missing)

In [None]:
knowledge_training

Next, we will create a correlation matrix of our dataframe to quantify the linear relationship between the attributes in our dataframe. Since our target attribute of prediction is a categorical one, it was easiest to encode our knowledge levels to be able to output a reasonable correlation matrix. The encoded dataframe and the correlation matrix can be seen below. It is clear that some attributes are less correlated with knowledge level than others; for that reason we chose to eliminate any attributes that have a consistent correlation of 0.15 or lower (absolute value). From the correlation matrix, we can see that the "study_time_for_related_subjects" attribute scored consistently between -0.15 and 0.15; therefore it was eliminated from our analysis as it didn't appear to have correlation with knowledge levels.

In [None]:
knowledge_training_encoded = pd.get_dummies(knowledge_training, columns=['knowledge'], prefix='KnowledgeLevel')
knowledge_training_encoded

In [None]:
correlation_matrix = knowledge_training_encoded.corr()
print(correlation_matrix)

In [None]:
knowledge_training_final = knowledge_training.drop(columns=['study_time_for_related_subjects'])
knowledge_training_final

Next, we choose to show some statistics about our dataframe. Namely, the split of the different knowledge levels that are present in our dataframe.

In [None]:
100 * knowledge_training_final.groupby('knowledge').size() / knowledge_training_final.shape[0]

In [None]:
knowledge_training_final['knowledge'].value_counts()

The next step is to create visualizations between each pair of attributes in our dataframe to help identify an existing correlation, if there is one.

### Study Time vs. Repetition by User:

In [None]:
import altair as alt

studyTime_repetitionByUser_plot = alt.Chart(knowledge_training_final, title = "Study Time vs. Repetition by User").mark_circle().encode(
    x=alt.X("study_time:Q", title="Study Time"),  
    y=alt.Y("repetition_by_user:Q", title="Repetition by User"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
studyTime_repetitionByUser_plot

### Study Time vs. Exam Performance for Related Subjects:

In [None]:
studyTime_examPerfForRelatedSubj_plot = alt.Chart(knowledge_training_final, title = "Study Time vs. Exam Performance for Related Subject").mark_circle().encode(
    x=alt.X("study_time:Q", title="Study Time"),  
    y=alt.Y("exam_performance_for_related_subjects:Q", title="Exam Performance for Related Subjects"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
studyTime_examPerfForRelatedSubj_plot

### Study Time vs. Exam Performance:

In [None]:
studyTime_examPerf_plot = alt.Chart(knowledge_training_final, title = "Study Time vs. Exam Performance").mark_circle().encode(
    x=alt.X("study_time:Q", title="Study Time"),  
    y=alt.Y("exam_performance:Q", title="Exam Performance"), 
    color=alt.Color("knowledge:N", title="Knowledge Level") 
)
studyTime_examPerf_plot