# Statistics of the lengths

We want to check that all the transformations we have done so far are sane so that we can work with a cleaned up dataset.

In [1]:
import pandas as pd

df = pd.read_json("../data/processed/data.json")
df.head()

Unnamed: 0,Description,Description_Length,Label,Name,Procedures,Procedures_Description_Ratio,Procedures_Length
0,SCP-1256 is a 24-page pamphlet entitled 'Bees ...,1837,SAFE,Item #: SCP-1256,Mobile Task Force Zeta-4 ('Beekeepers') is cur...,0.224279,412
1,SCP-2987 is a modified MSI brand external hard...,2187,SAFE,Item #: SCP-2987,SCP-2987 is to be kept on floor 17 of Site-88....,0.203475,445
2,SCP-2039 collectively refers to two distinct f...,5399,EUCLID,Item #: SCP-2039,"Presently, Foundation efforts at Research Faci...",0.368772,1991
3,SCP-1530 is a two-story abandoned house locate...,3893,EUCLID,Item #: SCP-1530,SCP-1530 is currently contained 120 meters fro...,0.201387,784
4,SCP-1524 is the sole remaining specimen of a s...,3211,EUCLID,Item #: SCP-1524,Both of SCP-1524's individual components are t...,0.530364,1703


Let's look at some statistics of the extracted text lengths and the ratio.

In [2]:
df.describe()

Unnamed: 0,Description_Length,Procedures_Description_Ratio,Procedures_Length
count,2700.0,2700.0,2700.0
mean,3208.542222,0.28684,777.595556
std,1658.345674,0.293568,519.808074
min,61.0,0.0,0.0
25%,2104.75,0.145726,414.75
50%,2887.0,0.229935,656.5
75%,3957.0,0.353646,994.25
max,31618.0,7.377049,7922.0


Whereas *count*, *mean*, *min* and *max* are self-explanatory, *std* stands for
*standard deviation*. The rows with percentages are the 25%-, 50%-, and
75%-*quantiles*, respectively. They were defined in [my Blog post on means and
medians](https://paul-grillenberger.de/?p=21). Here's a short refresher: The 25%-quantile is a value such that 25%
of the data is smaller than or equal to it and the other 75% of the data is
greater than or equal to it. The 50%-quantile is also known as the median.

The minimum
of 61 characters in Description_Length looks reasonable but a Containment
Procedure with 0 characters? This has to be investigated. Before we do so, let
us look at the same statistics but grouped by each label.

In [3]:
df.groupby("Label").describe().stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Description_Length,Procedures_Description_Ratio,Procedures_Length
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
EUCLID,count,1274.0,1274.0,1274.0
EUCLID,mean,3244.361852,0.308139,855.422292
EUCLID,std,1701.660229,0.273383,529.89666
EUCLID,min,428.0,0.011165,148.0
EUCLID,25%,2179.25,0.169438,497.25
EUCLID,50%,2935.5,0.259065,727.0
EUCLID,75%,3977.75,0.371186,1075.75
EUCLID,max,31618.0,6.051948,7922.0
KETER,count,314.0,314.0,314.0
KETER,mean,3380.487261,0.401208,1128.343949


This is where it starts to get interesting! As *safe* SCPs are much easier to contain than *euclid* ones which in turn are easier to contain than *keter* SCPs, we expect that the Containment Procedures are easier to describe for safe ones and need more elaborate descriptions for keter ones. On average, this is reflected in the mean length of Containment Procedures (579 for safe, 833 for euclid and 1108 for keter).

Let us turn to the problematic cases of zero lengths.

In [4]:
df.loc[(df["Procedures_Length"] == 0) | (df["Description_Length"] == 0)]

Unnamed: 0,Description,Description_Length,Label,Name,Procedures,Procedures_Description_Ratio,Procedures_Length
1340,SCP-1994 is the general designation for a set ...,1376,KETER,Item #: SCP-1994,,0.0,0


Thankfully, this is a single outlier. Investigating the article on the SCP Foundation web page and inspecting the html yields that the label "Special Containment Procedures" sits in its own `p` element so that we were not able to crawl this article correctly.

Let us ignore the outlier.

In [5]:
df = df.loc[df["Procedures_Length"] > 0]

Finally, let us compute correlations between our features and the target. The correlation coefficient may be computed for number-valued random variables. Thankfully, the *nominal* labels safe, euclid, and keter, carry *ordinal* information. That is to say, we can order them by their *containment complexity*.
To make this even more explicit, let us assign numbers to the three labels. A safe label will be converted to -1, a euclid label to 0 and a keter label to 1 so that the order of the containment complexity is reflected by $\mathrm{safe} < \mathrm{euclid} < \mathrm{keter}$. However, the magnitude of this conversion is still open for discussion. Alternatively, we could have choosen $10^{100}$ for keter and this would have influenced the correlation coefficients. But let's stick to our simple way of converting for now.

In [6]:
COMPLEXITY = {
    "SAFE": -1,
    "EUCLID": 0,
    "KETER": 1
}

def compute_complexity(label):
    return COMPLEXITY[label]

df["Complexity"] = df["Label"].apply(compute_complexity)
df.corr()

Unnamed: 0,Description_Length,Procedures_Description_Ratio,Procedures_Length,Complexity
Description_Length,1.0,-0.293831,0.220675,0.052532
Procedures_Description_Ratio,-0.293831,1.0,0.577548,0.188953
Procedures_Length,0.220675,0.577548,1.0,0.344329
Complexity,0.052532,0.188953,0.344329,1.0


As it turns out, Complexity and Procedures_Length are positively correlated which is precisely what we have observed through the statistics that we have grouped by label. We also see that Description_Length is only very weakly correlated with Complexity: That is to say that there is no reason why, say, a safe SCP should not have a long description or why a keter SCP could not be described in a short manner.