# Week 4: An intro to Python, Pandas and Jupyter 

## Environment basics
- Class [Github repo](https://github.com/stiles/usc)
- Structure and workflow
- Clone repo using Github Desktop | [documentation](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/adding-and-cloning-repositories/cloning-a-repository-from-github-to-github-desktop)
- Set a directory structure locally and stick to it!

### Jupyter Lab interface basics
- Directory pane
- [Documentation](https://jupyterlab.readthedocs.io/en/stable/)
- Typing and executing code in cells
- Cell types
- Running/restarting a notebook
- Markdown/documentation
- Keyboard shortcuts
- Copy/paste cells

### Import Python tools
These are the Python libraries we will use to complete our work. Here we only need to import [Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started). 

In [16]:
import pandas as pd

### Now what? 

#### Let's start with simple math

In [17]:
10 + 6

16

In [18]:
100 / 5

20.0

In [19]:
6 * 10

60

#### Defining a variable

In [20]:
number = 100

In [21]:
print(number)

100


In [22]:
number + 3

103

In [23]:
print(number, number, number)

100 100 100


#### Conduct a simple data analysis

In [24]:
my_list = [2, 4, 6, 8, 10, 12, 14, 16]

In [25]:
my_list

[2, 4, 6, 8, 10, 12, 14, 16]

In [26]:
my_series = pd.Series(my_list)

In [27]:
# my_series = pd.Series(np.random.randint(10, 1000,size=100000000))

In [28]:
my_series

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
dtype: int64

#### Descriptive statistics

Once the data becomes a Series, you can immediately run a wide range of [descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics). Let’s try a few.

In [29]:
my_series.sum()

72

#### Then find the maximum value in the next

In [30]:
my_series.max()

16

#### The minimum value in the next

In [31]:
my_series.min()

2

#### How about the average (also known as the mean)? Keep adding cells and calculating new statistics.

In [32]:
my_series.mean()

9.0

#### The median?

In [33]:
my_series.median()

9.0

#### The standard deviation?

In [34]:
my_series.std()

4.898979485566356

####  And all of the above, plus a little more about the distribution, in one simple command.

In [35]:
my_series.describe().round()

count     8.0
mean      9.0
std       5.0
min       2.0
25%       6.0
50%       9.0
75%      12.0
max      16.0
dtype: float64

---

## Import data

#### Read a CSV file with members of Congress

In [98]:
df_csv = pd.read_csv('../../data/raw/members_of_congress_117.csv')

In [99]:
df_csv.head()

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
0,A000370,10,Alma Adams,N.C.,Alma,Adams,house,117,NC,D,-0.465,98.65
1,A000055,26,Robert Aderholt,Ala.,Robert,Aderholt,house,117,AL,R,0.376,97.02
2,A000371,8,Pete Aguilar,Calif.,Pete,Aguilar,house,117,CA,D,-0.294,98.43
3,A000372,8,Rick Allen,Ga.,Rick,Allen,house,117,GA,R,0.696,93.12
4,A000376,4,Colin Allred,Texas,Colin,Allred,house,117,TX,D,,97.54


#### Or an Excel file

In [100]:
df_excel = pd.read_excel('../../data/raw/members_of_congress_117_excel.xlsx')

#### Import from a URL

In [101]:
df_url = pd.read_csv('https://raw.githubusercontent.com/stiles/notebooks/master/congress/output/members_of_congress_117.csv')

#### Make a copy and assign a new variable

In [102]:
df = df_csv.copy()

---

## Understanding the dataframe

#### Use the `describe()` to get summary stats on any numerical columns

In [103]:
df.describe().round()

Unnamed: 0,seniority,session,dw_nominate,votes_with_party_pct
count,548.0,548.0,385.0,542.0
mean,11.0,117.0,0.0,95.0
std,9.0,0.0,0.0,5.0
min,1.0,117.0,-1.0,55.0
25%,4.0,117.0,-0.0,94.0
50%,8.0,117.0,-0.0,98.0
75%,15.0,117.0,0.0,99.0
max,50.0,117.0,1.0,100.0


#### Use the `info()` method for data types and columns

In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 548 entries, 0 to 547
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    548 non-null    object 
 1   seniority             548 non-null    int64  
 2   full_name             548 non-null    object 
 3   apstate               536 non-null    object 
 4   first_name            548 non-null    object 
 5   last_name             548 non-null    object 
 6   chamber               548 non-null    object 
 7   session               548 non-null    int64  
 8   state                 548 non-null    object 
 9   party                 548 non-null    object 
 10  dw_nominate           385 non-null    float64
 11  votes_with_party_pct  542 non-null    float64
dtypes: float64(2), int64(2), object(8)
memory usage: 51.5+ KB


#### Reading one column, or "series"

In [105]:
df['full_name']

0              Alma Adams
1         Robert Aderholt
2            Pete Aguilar
3              Rick Allen
4            Colin Allred
              ...        
543      Elizabeth Warren
544    Sheldon Whitehouse
545          Roger Wicker
546             Ron Wyden
547            Todd Young
Name: full_name, Length: 548, dtype: object

#### Counting values in categorical or string columns

In [106]:
df.party.value_counts()

D     278
R     268
ID      2
Name: party, dtype: int64

---

## Interacting with the data

#### Use the `head()` method to see the first *n* rows

In [107]:
df.head(10)

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
0,A000370,10,Alma Adams,N.C.,Alma,Adams,house,117,NC,D,-0.465,98.65
1,A000055,26,Robert Aderholt,Ala.,Robert,Aderholt,house,117,AL,R,0.376,97.02
2,A000371,8,Pete Aguilar,Calif.,Pete,Aguilar,house,117,CA,D,-0.294,98.43
3,A000372,8,Rick Allen,Ga.,Rick,Allen,house,117,GA,R,0.696,93.12
4,A000376,4,Colin Allred,Texas,Colin,Allred,house,117,TX,D,,97.54
5,A000369,12,Mark Amodei,Nev.,Mark,Amodei,house,117,NV,R,0.38,95.43
6,A000377,4,Kelly Armstrong,N.D.,Kelly,Armstrong,house,117,ND,R,,95.19
7,A000375,6,Jodey Arrington,Texas,Jodey,Arrington,house,117,TX,R,0.648,89.31
8,A000148,2,Jake Auchincloss,Mass.,Jake,Auchincloss,house,117,MA,D,,98.65
9,A000378,4,Cynthia Axne,Iowa,Cynthia,Axne,house,117,IA,D,,96.87


#### Use the `tail()` method to see the first *n* rows

In [108]:
df.tail(10)

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
538,T000461,11,Patrick Toomey,Pa.,Patrick,Toomey,senate,117,PA,R,0.63,90.91
539,T000278,1,Tommy Tuberville,Ala.,Tommy,Tuberville,senate,117,AL,R,,88.46
540,V000128,5,Chris Van Hollen,Md.,Chris,Van Hollen,senate,117,MD,D,-0.391,99.62
541,W000805,13,Mark Warner,Va.,Mark,Warner,senate,117,VA,D,-0.203,99.62
542,W000790,1,Raphael Warnock,Ga.,Raphael,Warnock,senate,117,GA,D,,99.03
543,W000817,9,Elizabeth Warren,Mass.,Elizabeth,Warren,senate,117,MA,D,-0.759,97.91
544,W000802,15,Sheldon Whitehouse,R.I.,Sheldon,Whitehouse,senate,117,RI,D,-0.355,99.05
545,W000437,15,Roger Wicker,Miss.,Roger,Wicker,senate,117,MS,R,0.378,91.2
546,W000779,26,Ron Wyden,Ore.,Ron,Wyden,senate,117,OR,D,-0.33,98.87
547,Y000064,5,Todd Young,Ind.,Todd,Young,senate,117,IN,R,0.465,90.33


#### Sorting with the `sort_values()` method to find member with most seniority

In [96]:
df.sort_values('seniority', ascending=False).head(10)

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
444,Y000033,50,Don Young,Alaska,Don,Young,house,117,AK,R,0.283,88.45
497,L000174,47,Patrick Leahy,Vt.,Patrick,Leahy,senate,117,VT,D,-0.36,99.62
188,H000874,42,Steny Hoyer,Md.,Steny,Hoyer,house,117,MD,D,-0.38,98.41
340,R000395,42,Harold Rogers,Ky.,Harold,Rogers,house,117,KY,R,0.338,97.03
378,S000522,42,Christopher Smith,N.J.,Christopher,Smith,house,117,NJ,R,0.167,88.51
479,G000386,41,Charles Grassley,Iowa,Charles,Grassley,senate,117,IA,R,0.35,82.83
209,K000009,40,Marcy Kaptur,Ohio,Marcy,Kaptur,house,117,OH,D,-0.35,98.64
505,M000355,37,Mitch McConnell,Ky.,Mitch,McConnell,senate,117,KY,R,0.405,89.94
97,D000191,36,Peter DeFazio,Ore.,Peter,DeFazio,house,117,OR,D,-0.428,98.4
311,P000034,36,Frank Pallone,N.J.,Frank,Pallone,house,117,NJ,D,-0.404,98.66


#### Sorting with the `sort_values()` method to find member who's most liberal or conservative

In [97]:
df.sort_values('dw_nominate', ascending=True).head(10)

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
543,W000817,9,Elizabeth Warren,Mass.,Elizabeth,Warren,senate,117,MA,D,-0.759,97.91
481,H001075,3,Kamala Harris,Calif.,Kamala,Harris,senate,117,CA,D,-0.709,100.0
197,J000298,6,Pramila Jayapal,Wash.,Pramila,Jayapal,house,117,WA,D,-0.691,97.52
238,L000551,26,Barbara Lee,Calif.,Barbara,Lee,house,117,CA,D,-0.681,97.96
427,W000187,32,Maxine Waters,Calif.,Maxine,Waters,house,117,CA,D,-0.655,95.12
428,W000822,8,Bonnie Watson Coleman,N.J.,Bonnie,Watson Coleman,house,117,NJ,D,-0.621,98.21
117,E000297,6,Adriano Espaillat,N.Y.,Adriano,Espaillat,house,117,NY,D,-0.615,97.97
71,C001067,16,Yvette Clarke,N.Y.,Yvette,Clarke,house,117,NY,D,-0.613,98.18
359,S001145,24,Jan Schakowsky,Ill.,Jan,Schakowsky,house,117,IL,D,-0.606,98.64
164,G000551,20,Ra√∫l Grijalva,Ariz.,Ra√∫l,Grijalva,house,117,AZ,D,-0.598,98.41


---

## What questions would you ask of this dataset? 

In [123]:
df['seniority'].max()

50

In [None]:
df['seniority'].min()

NameError: name 'df' is not defined

In [125]:
round(df['seniority'].mean(), 2)

10.88

In [126]:
round(df['votes_with_party_pct'].mean(), 2)

95.34

In [127]:
df.groupby(['chamber', 'party'])['dw_nominate'].mean()

chamber  party
house    D       -0.390068
         R        0.492037
senate   D       -0.349682
         ID      -0.345500
         R        0.501841
Name: dw_nominate, dtype: float64

In [128]:
df.groupby(['state'])['dw_nominate'].mean().sort_values().head()

state
MA   -0.459125
VT   -0.433667
HI   -0.384000
RI   -0.371750
NJ   -0.367800
Name: dw_nominate, dtype: float64

In [129]:
df.groupby(['party'])['seniority'].mean().sort_values().head()

party
R      9.298507
ID    12.000000
D     12.388489
Name: seniority, dtype: float64

---

## Export

In [130]:
df.to_csv('../../data/processed/members_of_congress_117.csv', index=False)
df.to_excel('../../data/processed/members_of_congress_117.xlsx')