## # Introduction
<p><img src="https://i.imgur.com/kjWF1So.jpg" alt="Different characters on a computer screen"></p>
<p>According to a 2019 <a href="https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/PasswordCheckup-HarrisPoll-InfographicFINAL.pdf">Google / Harris Poll</a>, 24% of Americans have used common passwords, like <code>abc123</code>, <code>Password</code>, and <code>Admin</code>. Even more concerning, 59% of Americans have incorporated personal information, such as their name or birthday, into their password. This makes it unsurprising that 4 in 10 Americans have had their personal information compromised online. Passwords with commonly used phrases and personal information makes cracking a password drastically easier.</p>
<p>You may have noticed over the years that password requirements have increased in complexity, including recommendations to change your passwords every couple of months. Compiled from industry recommendations, below is a list of passwords requirements you will be asked to test: </p>
<p><strong>Password Requirments:</strong></p>
<ol>
<li>Must be at least 10 characters in length</li>
<li>Must contain at least:<ul>
<li>one lower case letter </li>
<li>one upper case letter </li>
<li>one numeric character </li>
<li>one non-alphanumeric character</li></ul></li>
<li>Must not contain the phrase <code>password</code> (case insensitive)</li>
<li>Must not contain the user's first or last name, e.g., if the user's name is <code>John Smith</code>, then <code>SmItH876!</code> is not a valid password.</li>
</ol>
<p>Here is the dataset that you will investigate this project:</p>
<div style="background-color: #ebf4f7; color: #595959; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/logins.csv</b></div>
Each row represents a login credential. There are no missing values and you can consider the dataset "clean".
<ul>
    <li><b>id:</b> the user's unique ID.</li>
    <li><b>username:</b> the username with the format {firstname}.{lastname}.</li>
    <li><b>password:</b> the password that may or may not meet the requirements. <i>Note, passwords should never be saved in plaintext, always encrypt them when working with real live passwords!</i></li>
</ul>
</div>
<p>Warning: This dataset contains some <strong>real</strong> passwords leaked from <strong>real</strong> websites. These passwords have been filtered, but may still include words that are explicit and offensive.</p>
<p>From here on out, it will be your task to explore and manipulate the existing data until you can answer the two questions described in the instructions panel. Feel free to import as many packages as you need to complete your task, and add cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> To complete this project, you need to know how to manipulate strings in pandas DataFrames and be familiar with regular expressions. Before starting this project we recommend that you have completed the following courses: <a href="https://learn.datacamp.com/courses/data-cleaning-in-python">Data Cleaning in Python</a> and <a href="https://learn.datacamp.com/courses/regular-expressions-in-python">Regular Expressions in Python</a>.</p>

We have a dataset of various logins and we want to see if they fit certain password requirements. We will be analyzing the company's employees logins and identify the employees to update their passwords. 

In [3]:
import pandas as pd
import numpy as np

We will be requiring the username to make sure the user doesn't put their username into the password

**Challenge 1: What percentage of users have invalid passwords?**

In [4]:
logins = pd.read_csv('datasets/logins.csv')
logins.info() # there is no missing value
logins.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 982 entries, 0 to 981
Data columns (total 3 columns):
id          982 non-null int64
username    982 non-null object
password    982 non-null object
dtypes: int64(1), object(2)
memory usage: 23.1+ KB


Unnamed: 0,id,username,password
0,1,vance.jennings,vanceRules888!
1,2,consuelo.eaton,Mail_Pen%Scarlets.414
2,3,mitchel.perkins,Z00+1960
3,4,odessa.vaughan,D-rockyou
4,5,araceli.wilder,Araceli}r3


In [5]:
# We want at least 10 character
length_check = logins['password'].str.len() >= 10
length_check.head()

0     True
1     True
2    False
3    False
4     True
Name: password, dtype: bool

In [6]:
valid_pws = logins[length_check] # return values that are true for length check
valid_pws.head()

Unnamed: 0,id,username,password
0,1,vance.jennings,vanceRules888!
1,2,consuelo.eaton,Mail_Pen%Scarlets.414
4,5,araceli.wilder,Araceli}r3
5,6,shawn.harrington,126_239_123
6,7,evelyn.gay,`4:&iAt$'o~(


In [7]:
bad_pws = logins[~length_check] # the tilda signs indicates not true (False) for length check
bad_pws.head()

Unnamed: 0,id,username,password
2,3,mitchel.perkins,Z00+1960
3,4,odessa.vaughan,D-rockyou
9,10,brant.zimmerman,L?4)OSB$r
16,17,domingo.dyer,VeOw{*p
17,18,martin.pacheco,MP1985???


In [8]:
# Rule 2: One lowercase, one upper case, and one numeric and one non-alpha AT least

lcase = valid_pws['password'].str.contains('[a-z]')# this indicates all small letters with a to z
        # When you do '[abc]' then it looks if string has a,b, OR c
ucase = valid_pws['password'].str.contains('[A-Z]')
numeric = valid_pws['password'].str.contains('[0-9]')
special = valid_pws['password'].str.contains('\W') #stands for all special characters
pd.concat([valid_pws, lcase, ucase, numeric, special], axis = 1)
# Anytime if any of these last 4 cols are False, needs to be put in bad password

bad_pws.shape[0], valid_pws.shape[0] # We can see 422 invalid pw, and 560 valid pws

(422, 560)

In [9]:
char_check = lcase & ucase & numeric & special #this makes sure all 4 cols are True
# We will append on the earlier bad passwords, with the good passwords that don't meet the requirement 
bad_pws = bad_pws.append(valid_pws[~char_check], ignore_index = False)

valid_pws = valid_pws[char_check] # Note the order matters, bad_pws has be done before the good pws

bad_pws.shape[0], valid_pws.shape[0] # We can see invalid pw grew to 724




(724, 258)

In [10]:
# Rule 3: Must not contain the phrase password (case insensitive)
banned_phrases = valid_pws['password'].str.contains('password', case = False)

bad_pws = bad_pws.append(valid_pws[banned_phrases], ignore_index = False)
# We don't have tilde, b/c if the banned phrases is true, then we want it in bad passwords, and not in good password

valid_pws = valid_pws[~banned_phrases]
bad_pws.shape[0], valid_pws.shape[0] # So we lost one password

(725, 257)

In [11]:
# Rule 4: Must not contain the user's first and last name
valid_pws['first_name'] = valid_pws['username'].str.extract('(^[a-z]+)', expand = False)
valid_pws['last_name'] = valid_pws['username'].str.extract('([a-z]+$)', expand = False)
# ^ tells to look at the beginning of the string, $ indicates start at end and go until you find a non letters
valid_pws


Unnamed: 0,id,username,password,first_name,last_name
0,1,vance.jennings,vanceRules888!,vance,jennings
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,consuelo,eaton
4,5,araceli.wilder,Araceli}r3,araceli,wilder
6,7,evelyn.gay,`4:&iAt$'o~(,evelyn,gay
8,9,gladys.ward,=Wj1`i)xYYZ,gladys,ward
11,12,milford.hubbard,Milford<3Tom,milford,hubbard
13,14,jamie.cochran,Deviants.Assists.Impede+24,jamie,cochran
15,16,lorrie.gay,Q0G:[@u9*_`_,lorrie,gay
21,22,leticia.sanford,Parole:Seagull+Cession-148,leticia,sanford
23,24,brandie.webster,321.Snuffs-Pinball.Nougat,brandie,webster


In [17]:
# No we want to see if the first,last name is in the password
# i index like 972, and stuff, and row is the actual row content
# Since its case insensitive, we will lower everything
for index, row in valid_pws.iterrows():
    if row['first_name'] in row['password'].lower() or row['last_name'] in row['password'].lower():
        valid_pws = valid_pws.drop(index = index)
        bad_pws = bad_pws.append(row, ignore_index = True)

valid_pws[:20] # we have filtered them out
        

Unnamed: 0,id,username,password,first_name,last_name
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,consuelo,eaton
6,7,evelyn.gay,`4:&iAt$'o~(,evelyn,gay
8,9,gladys.ward,=Wj1`i)xYYZ,gladys,ward
13,14,jamie.cochran,Deviants.Assists.Impede+24,jamie,cochran
15,16,lorrie.gay,Q0G:[@u9*_`_,lorrie,gay
21,22,leticia.sanford,Parole:Seagull+Cession-148,leticia,sanford
23,24,brandie.webster,321.Snuffs-Pinball.Nougat,brandie,webster
29,30,rene.small,"]9""mP(kM4c",rene,small
30,31,rosanna.reid,Outguess%Dresser:Derails=669,rosanna,reid
33,34,patrica.hicks,Wanderer.849+Enlarges:Olympia,patrica,hicks


In [20]:
bad_pws.tail(10)

Unnamed: 0,id,username,password,first_name,last_name
726,5,araceli.wilder,Araceli}r3,araceli,wilder
727,12,milford.hubbard,Milford<3Tom,milford,hubbard
728,141,ronald.brooks,P1G_bT”_zBrooks,ronald,brooks
729,150,raymundo.haley,HaleyComet333$,raymundo,haley
730,668,simon.miranda,SimonR0ck$,simon,miranda
731,750,irvin.martinez,bananaIrvin8),irvin,martinez
732,790,sean.leon,SeansPa$$w0rd,sean,leon
733,829,ted.horne,dakota&ted4Ever,ted,horne
734,926,houston.garcia,earth2Houston!,houston,garcia
735,965,chrystal.burns,ChRYSTAL90?,chrystal,burns


In [21]:
# Hence the percentage of users who have invalid passwords
bad_pass = round(bad_pws.shape[0]/logins.shape[0],2) # 2 decimal places
bad_pass

0.75

## Challenge 2: Which users need to change their passwords?

In [22]:
email_list = bad_pws['username'].sort_values()
email_list[:20]

405           abdul.rowland
309            addie.cherry
372            adele.moreno
517            adeline.bush
279             adolfo.kane
337             adolfo.lara
16             ahmad.hopper
122              aida.combs
700           aisha.jenkins
199               al.dunlap
147            alana.franco
593         alberta.leblanc
521            alec.robbins
671    alejandra.stephenson
434         alejandro.burke
482        alejandro.nieves
205        alexander.thomas
400       alexandria.hinton
453       alexis.mccullough
93          alexis.reynolds
Name: username, dtype: object