## # Introduction
<p><img src="https://i.imgur.com/kjWF1So.jpg" alt="Different characters on a computer screen"></p>
<p>According to a 2019 <a href="https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/PasswordCheckup-HarrisPoll-InfographicFINAL.pdf">Google / Harris Poll</a>, 24% of Americans have used common passwords, like <code>abc123</code>, <code>Password</code>, and <code>Admin</code>. Even more concerning, 59% of Americans have incorporated personal information, such as their name or birthday, into their password. This makes it unsurprising that 4 in 10 Americans have had their personal information compromised online. Passwords with commonly used phrases and personal information makes cracking a password drastically easier.</p>
<p>You may have noticed over the years that password requirements have increased in complexity, including recommendations to change your passwords every couple of months. Compiled from industry recommendations, below is a list of passwords requirements you will be asked to test: </p>
<p><strong>Password Requirments:</strong></p>
<ol>
<li>Must be at least 10 characters in length</li>
<li>Must contain at least:<ul>
<li>one lower case letter </li>
<li>one upper case letter </li>
<li>one numeric character </li>
<li>one non-alphanumeric character</li></ul></li>
<li>Must not contain the phrase <code>password</code> (case insensitive)</li>
<li>Must not contain the user's first or last name, e.g., if the user's name is <code>John Smith</code>, then <code>SmItH876!</code> is not a valid password.</li>
</ol>
<p>Here is the dataset that you will investigate this project:</p>
<div style="background-color: #ebf4f7; color: #595959; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/logins.csv</b></div>
Each row represents a login credential. There are no missing values and you can consider the dataset "clean".
<ul>
    <li><b>id:</b> the user's unique ID.</li>
    <li><b>username:</b> the username with the format {firstname}.{lastname}.</li>
    <li><b>password:</b> the password that may or may not meet the requirements. <i>Note, passwords should never be saved in plaintext, always encrypt them when working with real live passwords!</i></li>
</ul>
</div>
<p>Warning: This dataset contains some <strong>real</strong> passwords leaked from <strong>real</strong> websites. These passwords have been filtered, but may still include words that are explicit and offensive.</p>
<p>From here on out, it will be your task to explore and manipulate the existing data until you can answer the two questions described in the instructions panel. Feel free to import as many packages as you need to complete your task, and add cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> To complete this project, you need to know how to manipulate strings in pandas DataFrames and be familiar with regular expressions. Before starting this project we recommend that you have completed the following courses: <a href="https://learn.datacamp.com/courses/data-cleaning-in-python">Data Cleaning in Python</a> and <a href="https://learn.datacamp.com/courses/regular-expressions-in-python">Regular Expressions in Python</a>.</p>

## Instructions 

Your two questions are as follows:

1. **What percentage of users have invalid passwords?**
    - Save your answer as a variable, 'bad_pass', in the form of a float rounded up to two decimals (e.g., 0.18)
2. **Which users need to change their passwords?**
    - Save your answer as a pandas Series consisting of the 'usernames' *in alphabetically ascending order* called 'email_list'. This will be used to automate email notifications to employees.

In [1]:
#Preliminaries
import pandas as pd

In [6]:
# Create dataframe from 'logins.csv' and evaluate data
logins = pd.read_csv("datasets/logins.csv")
logins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 982 entries, 0 to 981
Data columns (total 3 columns):
id          982 non-null int64
username    982 non-null object
password    982 non-null object
dtypes: int64(1), object(2)
memory usage: 23.1+ KB


In [7]:
logins.head(10)

Unnamed: 0,id,username,password
0,1,vance.jennings,vanceRules888!
1,2,consuelo.eaton,Mail_Pen%Scarlets.414
2,3,mitchel.perkins,Z00+1960
3,4,odessa.vaughan,D-rockyou
4,5,araceli.wilder,Araceli}r3
5,6,shawn.harrington,126_239_123
6,7,evelyn.gay,`4:&iAt$'o~(
7,8,noreen.hale,25941829163
8,9,gladys.ward,=Wj1`i)xYYZ
9,10,brant.zimmerman,L?4)OSB$r


In [8]:
# Use boolean indexing to separate valid vs invalid passwords into two dataframes 

# Password requirement 1: Must be at least 10 characters 
    #store 'password' column in 'length_check' variable and evaluate whether password length is greater than or less than 10 characters 
length_check = logins['password'].str.len() >= 10
length_check.head(10)

0     True
1     True
2    False
3    False
4     True
5     True
6     True
7     True
8     True
9    False
Name: password, dtype: bool

In [10]:
# Create a new dataframe that holds only accounts with valid passwords based on Requirement #1
valid_pws = logins[length_check]
valid_pws.head(10)

Unnamed: 0,id,username,password
0,1,vance.jennings,vanceRules888!
1,2,consuelo.eaton,Mail_Pen%Scarlets.414
4,5,araceli.wilder,Araceli}r3
5,6,shawn.harrington,126_239_123
6,7,evelyn.gay,`4:&iAt$'o~(
7,8,noreen.hale,25941829163
8,9,gladys.ward,=Wj1`i)xYYZ
10,11,leanna.abbott,"@_2.#,%~>~&+"
11,12,milford.hubbard,Milford<3Tom
12,13,mamie.fox,chichi821?


In [11]:
# Create a dataframe that hold only accounts with invalid passwords based on Requirement #1
bad_pws = logins[~length_check]
bad_pws.head(10)

Unnamed: 0,id,username,password
2,3,mitchel.perkins,Z00+1960
3,4,odessa.vaughan,D-rockyou
9,10,brant.zimmerman,L?4)OSB$r
16,17,domingo.dyer,VeOw{*p
17,18,martin.pacheco,MP1985???
18,19,shelby.massey,787175
19,20,rosella.barrett,TOBBY05
25,26,dianna.munoz,munoZ_001
26,27,julia.savage,"z5dm_c""R\"
27,28,loretta.bass,%%%bass


In [12]:
# Confirm that all accounts are in either of the two dataframes by summing the rows. The total sum must be 982
bad_pws.shape[0], valid_pws.shape[0]

(422, 560)

In [14]:
# Password Requirement 2: One lower case, one upper case, one numeric, one non-alphanumeric AT LEAST 
    # Create 4 boolean variables for the 4 requirements 
lowercase = valid_pws['password'].str.contains('[a-z]')
uppercase = valid_pws['password'].str.contains('[A-Z]')
numeric = valid_pws['password'].str.contains('[\d]') #[\d] is the regex version of [0-9]
special = valid_pws['password'].str.contains('[\W]') #[\d] is the regex version of any non-alphanumeric values

# Add the boolean variables as columns to 'valid_pws' Dataframe
pd.concat([valid_pws, lowercase, uppercase, numeric, special], axis=1)

Unnamed: 0,id,username,password,password.1,password.2,password.3,password.4
0,1,vance.jennings,vanceRules888!,True,True,True,True
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,True,True,True,True
4,5,araceli.wilder,Araceli}r3,True,True,True,True
5,6,shawn.harrington,126_239_123,False,False,True,False
6,7,evelyn.gay,`4:&iAt$'o~(,True,True,True,True
7,8,noreen.hale,25941829163,False,False,True,False
8,9,gladys.ward,=Wj1`i)xYYZ,True,True,True,True
10,11,leanna.abbott,"@_2.#,%~>~&+",False,False,True,True
11,12,milford.hubbard,Milford<3Tom,True,True,True,True
12,13,mamie.fox,chichi821?,True,False,True,True


In [15]:
# All boolean variables have to be true in order for requirement to be met
char_check = lowercase & uppercase & numeric & special 

bad_pws = bad_pws.append(valid_pws[~char_check], ignore_index=True) #bad_pws needs to be called first otherwise there wouldn't be any passwords to append if valid_pws[char_check] was run first
valid_pws = valid_pws[char_check]

#Confirm that all rows are accounted for in the two dataframes; total sum must be 982
bad_pws.shape[0], valid_pws.shape[0]

(724, 258)

In [16]:
# Password Requirement 3: Must not contain the phrase password (case sensitive)

banned_phrases = valid_pws['password'].str.contains('password', case=False)

bad_pws = bad_pws.append(valid_pws[banned_phrases], ignore_index=True)
valid_pws = valid_pws[~banned_phrases]

#Confirm that all rows are accounted for in the two dataframes; total sum must be 982
bad_pws.shape[0], valid_pws.shape[0]

(725, 257)

In [17]:
# Password Requirement 4: Must not contain the user's first or last name 

valid_pws['first_name'] = valid_pws['username'].str.extract('(^[a-z]+)', expand = False) #regex to pull the first name
valid_pws['last_name'] = valid_pws['username'].str.extract('([a-z]+$)', expand = False) #regex to pull the last name

valid_pws.head(10)

Unnamed: 0,id,username,password,first_name,last_name
0,1,vance.jennings,vanceRules888!,vance,jennings
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,consuelo,eaton
4,5,araceli.wilder,Araceli}r3,araceli,wilder
6,7,evelyn.gay,`4:&iAt$'o~(,evelyn,gay
8,9,gladys.ward,=Wj1`i)xYYZ,gladys,ward
11,12,milford.hubbard,Milford<3Tom,milford,hubbard
13,14,jamie.cochran,Deviants.Assists.Impede+24,jamie,cochran
15,16,lorrie.gay,Q0G:[@u9*_`_,lorrie,gay
21,22,leticia.sanford,Parole:Seagull+Cession-148,leticia,sanford
23,24,brandie.webster,321.Snuffs-Pinball.Nougat,brandie,webster


In [18]:
# Iterate each row's password and check if it contains either the first name or last name 

for i, row in valid_pws.iterrows():
    if row.first_name in row.password.lower(): #We change row.password.loser() to lowercase so that it removes case sensitivity 
        print(row)
        valid_pws = valid_pws.drop(index=i)
        bad_pws = bad_pws.append(row, ignore_index=True)

id                         1
username      vance.jennings
password      vanceRules888!
first_name             vance
last_name           jennings
Name: 0, dtype: object
id                         5
username      araceli.wilder
password          Araceli}r3
first_name           araceli
last_name             wilder
Name: 4, dtype: object
id                         12
username      milford.hubbard
password         Milford<3Tom
first_name            milford
last_name             hubbard
Name: 11, dtype: object
id                      668
username      simon.miranda
password         SimonR0ck$
first_name            simon
last_name           miranda
Name: 667, dtype: object
id                       750
username      irvin.martinez
password       bananaIrvin8)
first_name             irvin
last_name           martinez
Name: 749, dtype: object
id                      790
username          sean.leon
password      SeansPa$$w0rd
first_name             sean
last_name              leon
Name: 789, dtyp

In [19]:
valid_pws.head(10)

Unnamed: 0,id,username,password,first_name,last_name
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,consuelo,eaton
6,7,evelyn.gay,`4:&iAt$'o~(,evelyn,gay
8,9,gladys.ward,=Wj1`i)xYYZ,gladys,ward
13,14,jamie.cochran,Deviants.Assists.Impede+24,jamie,cochran
15,16,lorrie.gay,Q0G:[@u9*_`_,lorrie,gay
21,22,leticia.sanford,Parole:Seagull+Cession-148,leticia,sanford
23,24,brandie.webster,321.Snuffs-Pinball.Nougat,brandie,webster
29,30,rene.small,"]9""mP(kM4c",rene,small
30,31,rosanna.reid,Outguess%Dresser:Derails=669,rosanna,reid
33,34,patrica.hicks,Wanderer.849+Enlarges:Olympia,patrica,hicks


In [20]:
#if 'first_name' and 'last_name' are NaN, it is because the rows were appended before we added that information
bad_pws.head(10)

Unnamed: 0,id,username,password,first_name,last_name
0,3,mitchel.perkins,Z00+1960,,
1,4,odessa.vaughan,D-rockyou,,
2,10,brant.zimmerman,L?4)OSB$r,,
3,17,domingo.dyer,VeOw{*p,,
4,18,martin.pacheco,MP1985???,,
5,19,shelby.massey,787175,,
6,20,rosella.barrett,TOBBY05,,
7,26,dianna.munoz,munoZ_001,,
8,27,julia.savage,"z5dm_c""R\",,
9,28,loretta.bass,%%%bass,,


In [21]:
# Create a 'bad_pass' variable to hold the percentage of accounts that do not have strong passwords 
bad_pass = round(bad_pws.shape[0]/logins.shape[0],2)
bad_pass

0.75

In [22]:
# Create a 'email_list' variable to hold all email accounts with bad passwords
email_list = bad_pws['username'].sort_values()
print(email_list)

405           abdul.rowland
309            addie.cherry
372            adele.moreno
517            adeline.bush
279             adolfo.kane
337             adolfo.lara
16             ahmad.hopper
122              aida.combs
700           aisha.jenkins
199               al.dunlap
147            alana.franco
593         alberta.leblanc
521            alec.robbins
671    alejandra.stephenson
434         alejandro.burke
482        alejandro.nieves
205        alexander.thomas
400       alexandria.hinton
453       alexis.mccullough
93          alexis.reynolds
568          alfonso.weaver
151           alfonzo.johns
611          alisa.campbell
342             alisa.cohen
567             alison.neal
190          allan.marshall
142           alonzo.fowler
652           amado.bridges
88         amado.fitzgerald
592           amber.summers
               ...         
20              ursula.wood
280       valentin.castillo
596           valeria.curry
725          vance.jennings
319           vaness

In [None]:
%%nose

def test_bad_pass():
    assert bad_pass not in (0.7, 0.8, 0.778, 1), \
        "Did you round up to two decimals?"
    assert type(bad_pass)==float, \
        "Did you save `bad_pass` as a float?"
    assert bad_pass != 0.22, \
        "Did you calculate the percentage of valid passwords instead of invalid passwords?"
    assert bad_pass != 0.77, \
        "Did you check for first and last names in the passwords? Remember, the password is invalid if it contains (not equals) either the first name, the last name, or the word `password`."
    assert bad_pass != 0.70, \
        "Did you check that each password contains at least one numeric character?"
    assert bad_pass != 0.64, \
        "Did you check that each password contains at least one alphanumeric character?"
    assert bad_pass == 0.75, \
        "Have you properly checked the passwords for all four requirements?"
    
def test_email_list():
    test = pd.read_csv("datasets/emaillist.csv")
    assert type(email_list) == pd.core.series.Series, \
        "Did you save the usernames in a pandas Series?"
    assert email_list.iloc[0]=="abdul.rowland", \
        "Did you sort the user names in alphabetically ascending order?"
    assert email_list.reset_index(drop=True).equals(test['username']), \
        "Have you properly checked the passwords for all four requirements?"