## Tuples practice

From R Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

In [1]:
import csv

In [2]:
filepath = "passwords.csv"

with open(filepath) as f:
    data=[tuple(line) for line in csv.reader(f)]

The first item in our list "data" is the column names from our csv

In [3]:
columns = data[0] # tuple with column names
print(type(columns))
print(len(columns))
print(columns)

<class 'tuple'>
9
('rank', 'password', 'category', 'value', 'time_unit', 'offline_crack_sec', 'rank_alt', 'strength', 'font_size')


In [4]:
# delete the first record with column names for ease of working with the data
del data[0] 

In [5]:
data[0] # first row of the dataset

('1', 'password', 'password-related', '6.91', 'years', '2.17', '1', '8', '11')

### 1. Inspect the data
Examine the fields and what things mean, similar to harry potter dataset

Details on the dataset are here: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-14/readme.md

In [6]:
print(len(data), "rows")
print(len(data[0]), "columns")

print("\nExample row of the dataset: \n\n", data[0])

507 rows
9 columns

Example row of the dataset: 

 ('1', 'password', 'password-related', '6.91', 'years', '2.17', '1', '8', '11')


In [7]:
print("the most common password is: ", data[0][1])
print("the passwords are categorized, for example the password 'password' is the category:", data[0][2])
print(f"""passwords have a value field, which is a numerical time to crack such as: {data[0][3]}, as well as a time unit such as: {data[0][4]} """)

the most common password is:  password
the passwords are categorized, for example the password 'password' is the category: password-related
passwords have a value field, which is a numerical time to crack such as: 6.91, as well as a time unit such as: years 


### 2. What percent of passwords have a number in them?

Let's test a single example before we iterate through all the data. Here we have a test password with two numbers in it. I added lots of print statements to explain the thought process.

### (Version 0, testing)

Testing a small example we know the answer to

In [8]:
# Create a test string password to see if our logic works
password = "5password1"

nums = "1234567890" # string type variable with all possible numbers/ digits
print(f"nums is variable of type {type(nums)} of length {len(nums)}")

num_digits = 0 # count how many digits are in this single password
for num in nums: # Iterate over all the string characters (digits) in the nums variable for this password
    print("num:",num)
    if num in password:
        num_digits += 1
        print(f"    the number {num} is in the password!")

print("")
print("the for loop has finished")
print(f"the password we iterated through {password} contained {num_digits} digits")

nums is variable of type <class 'str'> of length 10
num: 1
    the number 1 is in the password!
num: 2
num: 3
num: 4
num: 5
    the number 5 is in the password!
num: 6
num: 7
num: 8
num: 9
num: 0

the for loop has finished
the password we iterated through 5password1 contained 2 digits


### Version 1 (Lots of looping)

Use the same logic/ looping we did in our test example, but this time on all passwords instead of just one.

In [9]:
nums = "1234567890" # string type variable with all possible numbers/ digits

digits_in_passwords = [] # list with number of digits present for each password

for row in data:
    password = row[1]
    num_digits = 0 # count how many digits are in this single password
    for num in nums: # Iterate over all the string characters (digits) in the nums variable for this password
        if num in password:
            num_digits += 1 # iterate counter if there's a digit
    digits_in_passwords.append(num_digits)

In [10]:
# our variable digits_in_passwords is the length of our data
print(len(digits_in_passwords), "items in the list")
print(len(digits_in_passwords) == len(data))

507 items in the list
True


In [11]:
avg_digits = sum(digits_in_passwords) / len(digits_in_passwords)
pass_with_digits = [p for p in digits_in_passwords if p > 0]
perc_passwords_with_digits = sum(pass_with_digits) / len(digits_in_passwords)
print(f"Average of {avg_digits} digits per password")
print(f"{sum(pass_with_digits)} passwords with digits, {perc_passwords_with_digits} percent of passwords in dataset")

Average of 0.2879684418145957 digits per password
146 passwords with digits, 0.2879684418145957 percent of passwords in dataset


### Version 2 (complicated)

In [12]:
numerical_digits = "1234567890"
password_example = "1password5"

# we could use a dictionary comprehension to get a list of True/ False boolean values for if every character in the password contained a digit or not
character_digit_booleans = [substring in password for substring in numerical_digits]
print(character_digit_booleans)
# then use the any operater to see if any value is True
any([substring in password for substring in numerical_digits])

[False, False, False, False, False, False, False, False, False, False]


False

In [13]:
numerical_digits = "1234567890" # define numerical digits we're searching for

count_passwords_with_num = 0

for row in data: # loop through all rows in the dataset
    
    # make a variable for the second element of the tuple which is the string text of the password
    password = row[1]
    # check if each digit in string is numerical digit or not (list of Boolean values length of password)
    password_digits = [substring_char in password for substring_char in numerical_digits]
    # check if any values are true or not
    if any(password_digits):
        # if a password has a number add it to the count
        count_passwords_with_num += 1
    

In [14]:
perc_passwords_with_digits = count_passwords_with_num / len(data)

print(f"There are {count_passwords_with_num} passwords with numbers")
print(f"There are {round(perc_passwords_with_digits*100, 2)}% of passwords in dataset have numbers")

There are 54 passwords with numbers
There are 10.65% of passwords in dataset have numbers


### 3. What percent of passwords have a special character in them? 


Optionally, you may find the `String` module helpful. Note: this is different from the operations we performed on strings in the past. The documentation helpful particularly reading about the `ASCII` methods. Read more in the documentation here: https://docs.python.org/3/library/string.html#string.punctuation

In [15]:
import string # import the string package
string.digits # call the digits method from the string module 

'0123456789'

In [16]:
punc = string.punctuation

# list of passwords only
passwords = [row[1] for row in data]

# create an empty list
special_chars_per_password = []

for password in passwords:
    num_special_chars = 0
    password_special_chars = [substring_char in password for substring_char in punc]
    special_chars_per_password.append(sum(password_special_chars))

In [17]:
perc_passwords_with_special_chars = sum(special_chars_per_password) / len(data)
count_passwords_with_special_chars = [p for p in special_chars_per_password if p > 0]

print(f"There are {perc_passwords_with_special_chars} passwords with numbers")
print(f"There are {round(perc_passwords_with_special_chars*100, 2)}% of passwords in dataset have numbers")

There are 0.0 passwords with numbers
There are 0.0% of passwords in dataset have numbers


In [18]:
# Problem? Nope. No passwords have any special characters!! 

# Oops... Nick didn't look at the data when creating this question, and made bad assumptions about the dataset 

### 4. Can you add an attribute to the tuple for number of special characters in the password?

In [19]:
# since we can't add elements to the tuple (they're imutable and can't be changed!)

# we could make a new list for our data

In [20]:
new_data = [] # new list to hold our tuples of data

for row in data:
    num_digits = 0 # our new attribute for the tuple
    
    password = row[1]
    
    password_digits = [substring_char in password for substring_char in string.punctuation]
    num_digits += sum(password_digits)
        
    new_row = (row[:], num_digits) # make a new tuple that's all our old data, and our new element num_digits
    new_data.append(new_row) # append our new tuple to a list

In [21]:
print(f"Example of a row", new_data[0])

Example of a row (('1', 'password', 'password-related', '6.91', 'years', '2.17', '1', '8', '11'), 0)


### 3. Correlation between password length and strength? number of special characters and strength?

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html


In [22]:
from scipy.stats import pearsonr

In [23]:
# subsetting the data to remove some pesky null values at the end of the dataset
# this is risky to drop without knowing why, but doing it for testing
data = data[:-7]

In [24]:
# let's make two lists that subset just the columns we're interested in
# in list comprehensions we're iterating through our list of tuples and selecting the element we care about
# we're also calling the float() method to change the type, or len() method to get the length of the string
strength = [float(row[3]) for row in data]
length = [len(row[1]) for row in data]

In [25]:
result = pearsonr(strength, length)
print(type(result))
print(result) # our result is a tuple!

<class 'tuple'>
(0.08344090882053395, 0.062268658837267885)


In [26]:
# we can unpack this tuple to assign the result to variables in one step
corr, p_val = pearsonr(strength, length)

In [27]:
# Analyze our result
print("corr", corr)
print("p_val", p_val)
if p_val <= .05:
    "We reject the null hypothesis!"
else:
    "We fail to reject null hypothesis that length and strength are uncorrelated"

corr 0.08344090882053395
p_val 0.062268658837267885
