# Quantitative and Categorical Data for EDA and Price Analysis


We are performing an inner join based on UUID, then drop null values.

 **There might be concerns here**: null values might be due to rules, and will exclude the majority of cards.
 - Power: only for creatures
 - toughness: only for creatures
 - Color Identity: not for instants and spells
 - Salthines: arbitraty opinion
 
 We cannot make general claims about the data when we know it is not general


In [27]:
# Load necessary libraries
import pandas as pd 
import numpy as np  
import matplotlib.pyplot as plt 
from scipy import stats
import seaborn as sns
from matplotlib import gridspec
import math
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

sns.set()

In [30]:
# Card info
cards_csv = pd.read_csv('../dataset/cards.csv', sep=";")
prices_csv = pd.read_csv('../dataset/cardPrices.csv', sep=",")


# Just to test we are using Creatures as our analysis 
all_data = pd.merge(prices_csv, cards_csv, on="uuid")
all_data.dropna(inplace=True)
data = all_data.reset_index(drop=True)

mapped_data = data.select_dtypes(include=['number'])
label_encoder = LabelEncoder()

# one-hot categorical data for general analysis
to_encode = ['rarity', 'artist', 'finishes', 'currency','hasFoil' , 'supertypes',
             'hasNonFoil',  'gameAvailability', 'priceProvider', 'setCode', 'type']
for enc in to_encode:    
    mapped_data[enc] = label_encoder.fit_transform(data[enc])

In [31]:
# Update mapped file
mapped_data.to_csv('../dataset/mapped_data.csv', index=False)
mapped_data.head()

Unnamed: 0,price,edhrecRank,edhrecSaltiness,manaValue,rarity,artist,finishes,currency,hasFoil,supertypes,hasNonFoil,gameAvailability,priceProvider,setCode,type
0,0.02,9476.0,0.84,8.0,3,77,3,1,1,0,1,0,0,83,53
1,0.03,9476.0,0.84,8.0,3,77,3,1,1,0,1,0,0,83,53
2,1.87,9476.0,0.84,8.0,3,77,3,1,1,0,1,1,4,83,53
3,29.16,9476.0,0.84,8.0,3,77,3,1,1,0,1,1,4,83,53
4,0.91,9476.0,0.84,8.0,3,77,3,0,1,0,1,1,2,83,53


In [34]:
# Remove outliers to anbalyze prices
def removeOutliers(data, col):
    Q3 = np.quantile(data[col], 0.75)
    Q1 = np.quantile(data[col], 0.25)
    IQR = Q3 - Q1
 
    lower_range = Q1 - 1.5 * IQR
    upper_range = Q3 + 1.5 * IQR
    outlier_free_list = [x for x in data[col] if ((x > lower_range) & (x < upper_range))]
    filtered_data = data.loc[data[col].isin(outlier_free_list)]
    return filtered_data

mapped_data = removeOutliers(mapped_data, 'price')


In [37]:
# normalize the data
scaler = MinMaxScaler()
norm_data = pd.DataFrame(scaler.fit_transform(mapped_data), columns=mapped_data.columns)

display(norm_data.head())

Unnamed: 0,price,edhrecRank,edhrecSaltiness,manaValue,rarity,artist,finishes,currency,hasFoil,supertypes,hasNonFoil,gameAvailability,priceProvider,setCode,type
0,0.002653,0.369006,0.321429,0.636364,0.75,0.308,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.882979,0.222689
1,0.005305,0.369006,0.321429,0.636364,0.75,0.308,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.882979,0.222689
2,0.493369,0.369006,0.321429,0.636364,0.75,0.308,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.882979,0.222689
3,0.238727,0.369006,0.321429,0.636364,0.75,0.308,1.0,0.0,1.0,0.0,1.0,1.0,0.5,0.882979,0.222689
4,0.156499,0.369006,0.321429,0.636364,0.75,0.308,1.0,1.0,1.0,0.0,1.0,1.0,0.25,0.882979,0.222689


In [38]:
import statsmodels.api as sm

X = norm_data.loc[:, ~norm_data.columns.isin(['price'])]
y = norm_data["price"]

model = sm.OLS(y, X)    # Describe model

result = model.fit()       # Fit model

print(result.summary()) 

                                 OLS Regression Results                                
Dep. Variable:                  price   R-squared (uncentered):                   0.576
Model:                            OLS   Adj. R-squared (uncentered):              0.575
Method:                 Least Squares   F-statistic:                              646.3
Date:                Sun, 06 Oct 2024   Prob (F-statistic):                        0.00
Time:                        13:33:49   Log-Likelihood:                          970.99
No. Observations:                6209   AIC:                                     -1916.
Df Residuals:                    6196   BIC:                                     -1828.
Df Model:                          13                                                  
Covariance Type:            nonrobust                                                  
                       coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------

In [36]:
fstat, pvalue = sm.stats.linear_rainbow(result)
print(f' f stat: {fstat:4f} | p value: {pvalue:4f}')

 f stat: 1.541449 | p value: 0.000000


# Interpretation of results

#### R-squared and Adjusted R-squared
- **R-squared (0.576)**: This indicates that approximately **57.6%** of the variability in card prices can be explained by the independent variables included in the model. This suggests a moderate level of explanatory power.
  
- **Adjusted R-squared (0.575)**: This value adjusts for the number of predictors in the model, confirming that the model's explanatory power remains consistent even after accounting for additional variables.

#### F-statistic and Prob (F-statistic)
- **F-statistic (646.3)**: A high F-statistic indicates that at least one predictor variable significantly contributes to explaining the variability in price.
  
- **Prob (F-statistic) (0.00)**: This p-value indicates that the overall regression model is statistically significant, meaning that the independent variables collectively have a significant effect on card prices.

The OLS regression results suggest several factors significantly influence card prices, including EDHREC saltiness, mana value, rarity, and whether the card is foil or not:

- The negative impact of rarity on price may indicate market dynamics where rarer cards are less frequently sold or valued differently.
  
- The positive relationship between EDHREC saltiness and price suggests that cards perceived as more desirable or playable are valued higher by collectors and players.

---

# Random Forest Prediction