## Experiential Task 1 ##
Patents provide an important mechanism for individuals and organizations to protect intellectual property. They encourage innovation and have been extensively linked to firm value. Given their importance to organizations, researchers in economics, finance, and accounting have explored various factors that contribute to patent awards (e.g., R&D expense, industry membership, competition, etc.) and outcomes attributable to patents (e.g., risk profile, executive turnover, etc.). 

One very popular study in the area, [Kogan, Papanikolaou, Seru, and Stoffman (*QJE* 2017; hereafter KPSS)](https://academic.oup.com/qje/article-abstract/132/2/665/3076284), quantify the value created by individual patents and show their measure positively predicts future firm growth. 

In this task, you will use the data provided by KPSS (**KPSS_2022.csv**) and explore how certain features of patents correlate with patent value. Specific requirements include the following:
1. Load the KPSS data and restrict it to patent grants approved in January of 2019.
2. Merge in the company names using the "permno" field and the "crsp_names.csv" dataset. Comment on which organization has the highest average value per patent.
3. Obtain the original patent grants for these patents from the USPTO using the Bulkdata API.
4. Generate a document-term matrix from the patent "abstract" (one paragraph summary about patent) and report the 25 most commonly used words and phrases. Use the following criteria for preprocessing and tokenizing your data: 
    - Include only those tokens that are all letters (alpha) and use lowercase for everything
    - Allow for single words and bigrams
    - Require tokens be at least 3 characters long
    - Exclude stop words
    - Restrict the matrix to the 1,000 most common words
5. Identify and report the 10 terms that correlate most positively and 10 terms that correlate most negatively with patent value ('xi_real' from the KPSS data)

### STEP 1: Load KPSS data & restrict to 2019
Use pandas to load the KPSS data into a data frame called `df`:

In [96]:
import pandas as pd
df = pd.read_csv('KPSS_2022.csv')
df

Unnamed: 0,patent_num,permno,issue_date,filing_date,xi_nominal,xi_real,cites
0,1570604,15368,19260126,19230604.0,0.156578,0.863388,5
1,1570677,10786,19260126,19220726.0,0.029288,0.161497,0
2,1570692,10807,19260126,19241117.0,0.508039,2.801388,0
3,1570694,14613,19260126,19231217.0,0.193691,1.068035,0
4,1570923,10401,19260126,19221229.0,0.206683,1.139674,0
...,...,...,...,...,...,...,...
3160448,11540212,76076,20221227,20210108.0,46.735361,15.368657,0
3160449,11540211,77178,20221227,20200922.0,11.902130,3.913948,0
3160450,11540209,85425,20221227,20170612.0,1.789387,0.588430,0
3160451,11540239,77178,20221227,20210119.0,11.902130,3.913948,0


The column `issue_date` is not in a datetime format, so we'll use `pd.todatetime` to convert it. Note that because the number is an 8 digit number, we need to provide the format. I've provided you that code:

In [98]:
df['issue_date'] = pd.to_datetime(df['issue_date'],format="%Y%m%d")
df

Unnamed: 0,patent_num,permno,issue_date,filing_date,xi_nominal,xi_real,cites
0,1570604,15368,1926-01-26,19230604.0,0.156578,0.863388,5
1,1570677,10786,1926-01-26,19220726.0,0.029288,0.161497,0
2,1570692,10807,1926-01-26,19241117.0,0.508039,2.801388,0
3,1570694,14613,1926-01-26,19231217.0,0.193691,1.068035,0
4,1570923,10401,1926-01-26,19221229.0,0.206683,1.139674,0
...,...,...,...,...,...,...,...
3160448,11540212,76076,2022-12-27,20210108.0,46.735361,15.368657,0
3160449,11540211,77178,2022-12-27,20200922.0,11.902130,3.913948,0
3160450,11540209,85425,2022-12-27,20170612.0,1.789387,0.588430,0
3160451,11540239,77178,2022-12-27,20210119.0,11.902130,3.913948,0


Now restrict the original dataframe, `df`, to only those patents with issue dates in January of 2019. Call this new dataframe `sub`.

In [100]:
sub = df[(df['issue_date'] >= '2019-01-01') & (df['issue_date']<= '2019-01-31')]
sub

Unnamed: 0,patent_num,permno,issue_date,filing_date,xi_nominal,xi_real,cites
2787418,10172273,77520,2019-01-08,20170202.0,26.409036,10.030283,0
2787419,10172275,14144,2019-01-08,20161229.0,9.621185,3.654174,2
2787420,10172291,14144,2019-01-08,20170116.0,9.621185,3.654174,0
2787421,10172277,14144,2019-01-08,20161006.0,9.621185,3.654174,6
2787422,10172280,14144,2019-01-08,20140429.0,9.621185,3.654174,0
...,...,...,...,...,...,...,...
2793532,10194506,84788,2019-01-29,20180405.0,168.984771,64.181254,0
2793533,10194511,92469,2019-01-29,20170731.0,2.362142,0.897153,2
2793534,10194514,84381,2019-01-29,20151111.0,28.048238,10.652860,0
2793535,10194482,59328,2019-01-29,20171212.0,22.343084,8.486014,0


### STEP 2: Merge in Company Names and report high values
The data from KPSS has a field called `permno` which allows us to link to a dataset called CRSP ("Center for Research in Security Prices"). I've provided the "names" file for you, so you can merge in organization names to the patent data.

Here's the code you'll need to load the dataset, assuming "crsp_names.csv" is in the same folder as this demo. Label use the variable name `names` for this dataset:

In [102]:
## Load Data:
names = pd.read_csv('crsp_names.csv') # Add code here
names # examine dataset

Unnamed: 0,DATE,COMNAM,PERMNO,PERMCO
0,1986-01-07,OPTIMUM MANUFACTURING INC,10000,7952
1,1986-01-09,GREAT FALLS GAS CO,10001,7953
2,1993-11-22,ENERGY WEST INC,10001,7953
3,2009-08-04,ENERGY INC,10001,7953
4,2010-07-09,GAS NATURAL INC,10001,7953
...,...,...,...,...
49736,2013-04-10,VOLTARI CORP,93433,53451
49737,2010-06-14,S & W SEED CO,93434,53427
49738,2010-06-14,SINO CLEAN ENERGY INC,93435,53452
49739,2010-06-29,TESLA MOTORS INC,93436,53453


This file has a history of names for each `PERMNO`, as you should be able to observe. Since our data is relatively recent, drop duplicate values by `PERMNO`, retaining the **last** company name associated with each `permno` (HINT: you should use `drop_duplicates()` for this; see __[documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)__ for how to use and how to retain the final observation).

In [104]:
names = names.drop_duplicates(subset=['PERMNO'], keep='last') # add code to drop duplicates
names

Unnamed: 0,DATE,COMNAM,PERMNO,PERMCO
0,1986-01-07,OPTIMUM MANUFACTURING INC,10000,7952
4,2010-07-09,GAS NATURAL INC,10001,7953
7,2002-05-15,BANCTRUST FINANCIAL GROUP INC,10002,7954
8,1986-01-14,GREAT COUNTRY BK ASONIA CT,10003,7957
9,1986-01-15,CLOSE OUTS PLUS INC,10004,7960
...,...,...,...,...
49734,2010-06-08,JIANGBO PHARMACEUTICALS INC,93432,53450
49736,2013-04-10,VOLTARI CORP,93433,53451
49737,2010-06-14,S & W SEED CO,93434,53427
49738,2010-06-14,SINO CLEAN ENERGY INC,93435,53452


Now, merge this data into your `sub` dataset (you only need to merge "COMNAM", but you're free to keep the other columns if you wish). I recommend using `pd.merge`. Call your new dataset `sub2`:

In [106]:
sub2 = pd.merge(sub, names[['PERMNO', 'COMNAM']], left_on='permno', right_on='PERMNO', how='left') # add code to merge
sub2

Unnamed: 0,patent_num,permno,issue_date,filing_date,xi_nominal,xi_real,cites,PERMNO,COMNAM
0,10172273,77520,2019-01-08,20170202.0,26.409036,10.030283,0,77520,A G C O CORP
1,10172275,14144,2019-01-08,20161229.0,9.621185,3.654174,2,14144,C N H INDUSTRIAL N V
2,10172291,14144,2019-01-08,20170116.0,9.621185,3.654174,0,14144,C N H INDUSTRIAL N V
3,10172277,14144,2019-01-08,20161006.0,9.621185,3.654174,6,14144,C N H INDUSTRIAL N V
4,10172280,14144,2019-01-08,20140429.0,9.621185,3.654174,0,14144,C N H INDUSTRIAL N V
...,...,...,...,...,...,...,...,...,...
6114,10194506,84788,2019-01-29,20180405.0,168.984771,64.181254,0,84788,AMAZON COM INC
6115,10194511,92469,2019-01-29,20170731.0,2.362142,0.897153,2,92469,ECHOSTAR CORP
6116,10194514,84381,2019-01-29,20151111.0,28.048238,10.652860,0,84381,ROCKWELL AUTOMATION INC
6117,10194482,59328,2019-01-29,20171212.0,22.343084,8.486014,0,59328,INTEL CORP


Now, report the 20 companies with higher average value per patent. Use "xi_real" to measure value:

In [108]:
top20 = sub2.groupby('COMNAM',)['xi_real'].mean()
top20.sort_values(ascending=False).head(20)

COMNAM
NETFLIX INC                      364.899130
JPMORGAN CHASE & CO              285.497544
PEPSICO INC                      273.411966
BROADCOM INC                     272.633542
NVIDIA CORP                      215.235192
LILLY ELI & CO                   211.978281
BANK OF AMERICA CORP             208.387378
ABBVIE INC                       191.632884
LINDE PLC                        170.863245
STARBUCKS CORP                   151.072365
DISNEY WALT CO                   150.115539
LOWES COMPANIES INC              145.945266
ALIBABA GROUP HOLDING LTD        136.766018
EXXON MOBIL CORP                 129.484244
WALGREENS BOOTS ALLIANCE INC     127.495303
GILEAD SCIENCES INC              121.594688
PFIZER INC                       121.375057
CHEVRON CORP NEW                 120.859586
VERTEX PHARMACEUTICALS INC       108.523734
REGENERON PHARMACEUTICALS INC    102.237545
Name: xi_real, dtype: float64

### STEP 3: Use an API to acquire patent details
We're going to use a relatively new API to access the full details of the patent data. The endpoint for this API is `https://developer.uspto.gov/ibd-api/v1/application/grants`. You can review the syntax for the endppoint __[here](https://developer.uspto.gov/ibd-api/swagger-ui/index.html)__.

There are a variety of options for querying data, including simple keyword searches, abstract text, etc. Since we have that patent number in our dataset (`patent_num`), we're going to using the "patentNumber". For instance, the URL for searching for patent number 10172275 is `https://developer.uspto.gov/ibd-api/v1/application/grants?patentNumber=10172275`.

Acquire that one patent and inspect the results (NOTE: You should use `requests` for this, which I've set up for you below, with the keyword `verify` set to False. I've also included some code to suppress warnings.):

In [110]:
import requests
requests.packages.urllib3.disable_warnings()

address = 'https://developer.uspto.gov/ibd-api/v1/application/grants?patentNumber=10172275' # add string with address here
page = requests.get(address, verify=False)
print(page.status_code)
page.text

200


'{"results":[{"inventionSubjectMatterCategory":"utility","patentApplicationNumber":"US15394198","filingDate":"12-29-2016","mainCPCSymbolText":"A01B63/16","furtherCPCSymbolArrayText":["A01B49/06","A01B51/04","A01B63/22"],"inventorNameArrayText":["Totten Kip","Boriack Cale","Anderson Brian J.","Prickel Marvin A."],"abstractText":["In one embodiment, an agricultural implement system includes a pivotable lift assembly. The pivotable lift assembly includes a first bar member and a second bar member rotatively coupled to the first bar member. The pivotable lift assembly further includes a first wheel assembly disposed on a first end of the second bar member and a second wheel assembly disposed on a second end of the second bar member. The pivotable lift assembly also includes an attachment assembly configured to attach the pivotable lift assembly to an agricultural implement, wherein the pivotable lift assembly is configured to aid in carrying a weight of the agricultural implement."],"assig

For our analysis, we need "abstractText", which you should be able to see in the raw JSON file (feel free to use ctrl-F to find). Determine how to access this field and print the results below:

In [112]:
import json
### Add your results here:
record = json.loads(page.text) # loads = load from string
#print(record)
print(record['results'][0]['abstractText'])

['In one embodiment, an agricultural implement system includes a pivotable lift assembly. The pivotable lift assembly includes a first bar member and a second bar member rotatively coupled to the first bar member. The pivotable lift assembly further includes a first wheel assembly disposed on a first end of the second bar member and a second wheel assembly disposed on a second end of the second bar member. The pivotable lift assembly also includes an attachment assembly configured to attach the pivotable lift assembly to an agricultural implement, wherein the pivotable lift assembly is configured to aid in carrying a weight of the agricultural implement.']


In a moment, we're going to collect all of the patent data, but the documentation suggests you can collect more than 1 patent at a time. Instead of making thousands of individual queries, let's obtain the data in blocks of 100 patents. To do this, we should:
1. Create a list of lists, where each inner element is a list of 100 elements
2. Use `",".join()` to convert each list to a string when generating the webquery

There are a number of ways you could go about generating the list of lists. I'll save you some time and provide you with one line of code that takes care of this with list comprehension and `range()`:

In [114]:
# creates list of lists, where inner elements are each 100 patents (except last one)
patents = [sub.iloc[i:i+100]['patent_num'].astype(str).values.tolist() for i in range(0, len(sub), 100)]
len(patents)

62

So we have 62 groups to search, or 62 separate queries.

To finish up **Step 3**, fill in the code below to collect the data. I've set up a list to collect the data:

In [116]:
import time

records = [] # will use to collect the data

for i,patent_group in enumerate(patents):
    print(i+1) # counter to monitor status
    # Create a comma-separated string with the contents of patent_group
    pat_group = ','.join(patent_group) # add answer here
    
    # set up the URL you will use to access the data (I encourage you to use an f-string)
    url = f"https://developer.uspto.gov/ibd-api/v1/application/grants?patentNumber={pat_group}" # add answer here
    
    # use requests to get the URL
    page = requests.get(url,verify=False)
    
    # parse with JSON:
    response = page.json() # add answer here
    
    for record in response['results']:
        patent_number = record['patentNumber']
        abstract = record['abstractText'] # add how you will access the text of the abstract
        records.append({'patent_number':patent_number,
                        'abstract':abstract
                       })
        
    time.sleep(1) # Give the USPTO a short break (1 second)      
    

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62


Now create a dataframe from the data you just collected:

In [118]:
patent_abstracts = pd.DataFrame(records) # add code here to create DataFrame from "records"
patent_abstracts

Unnamed: 0,patent_number,abstract
0,10172273,[A vehicle control system for controlling the ...
1,10172274,[A vehicle platform comprises a central body t...
2,10172275,"[In one embodiment, an agricultural implement ..."
3,10172276,[An implement frame having a carriage frame fo...
4,10172277,[A port interface for a pneumatic distribution...
...,...,...
6114,10194511,"[Systems, methods, apparatus, and machine-read..."
6115,10194514,[Electrostatic charge grounding is achieved by...
6116,10194518,"[This disclosure describes systems, methods, a..."
6117,10194522,[A method comprises applying an adhesive to a ...


Finally, you can save this data so you don't have to re-run the collection step each time you work on this task:

In [120]:
patent_abstracts.to_csv("./patent_abstracts.csv",sep="^") # using ^ as delimiter to help keep text clean

### STEP 4: Generate Document Term Matrix

In [122]:
# Load your data if needed:
patent_abstracts = pd.read_csv("./patent_abstracts.csv",sep="^")
patent_abstracts['abstract'] = patent_abstracts['abstract'].str.replace(r"^\['|'\]$", '', regex=True)
patent_abstracts

Unnamed: 0.1,Unnamed: 0,patent_number,abstract
0,0,10172273,A vehicle control system for controlling the h...
1,1,10172274,A vehicle platform comprises a central body th...
2,2,10172275,"In one embodiment, an agricultural implement s..."
3,3,10172276,An implement frame having a carriage frame for...
4,4,10172277,A port interface for a pneumatic distribution ...
...,...,...,...
6114,6114,10194511,"Systems, methods, apparatus, and machine-reada..."
6115,6115,10194514,Electrostatic charge grounding is achieved by ...
6116,6116,10194518,"This disclosure describes systems, methods, an..."
6117,6117,10194522,A method comprises applying an adhesive to a f...


Here are your instructions again for how to preprocess the data and generate the DTM. You should use `CountVectorizer`.

<i>Use the following criteria for preprocessing and tokenizing your data: 
- <i>Include only those tokens that are all letters (alpha) and use lowercase for everything
- <i>Allow for single words and bigrams
- <i>Require tokens be at least 3 characters long
- <i>Exclude stop words (Use NLTK for stopwords)
- <i>Restrict the matrix to the 1,000 most common words
    
Use `vec` as the name for your vectorizer and `dtm` as the name for your document-term matrix. Other than that, you may choose to carry out this process by any method you wish. 

In [165]:
#pre-process to filter tokens
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens= [word.lower() for word in tokens if word.isalpha()] #all letters are alhpa and lowercase
    tokens = [word for word in tokens if len(word) >= 3] # lenght of token is at least 3
    return ' '.join(tokens)

patent_abstracts['clean_abstract'] = patent_abstracts['abstract'].apply(preprocess_text)

In [189]:
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

# Create DTM:
vec= CountVectorizer(ngram_range=(1, 2),lowercase=True,stop_words= 'english',max_features= 1000) # allow single words and bigrams, exclude stop words, and restrict to 1000
dtm = vec.fit_transform(patent_abstracts['clean_abstract'])
dtm.todense()
dtm_df = pd.DataFrame(dtm.todense(),columns=vec.get_feature_names_out())
dtm_df.head()

Unnamed: 0,access,accordance,according,account,acid,acquisition,action,actions,active,activity,...,wellbore,width,window,wire,wireless,wireless communication,wireless device,work,write,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [169]:
print(vec.get_feature_names_out()[:200]) # list corresponding to columns in dataframe
print(len(vec.get_feature_names_out()))

['access' 'accordance' 'according' 'account' 'acid' 'acquisition' 'action'
 'actions' 'active' 'activity' 'actual' 'actuator' 'adapted' 'addition'
 'additional' 'additionally' 'address' 'adhesive' 'adjacent' 'adjust'
 'adjusted' 'adjusting' 'adjustment' 'agent' 'air' 'aircraft' 'allocation'
 'allow' 'allowing' 'allows' 'amplifier' 'analog' 'analysis' 'analyzing'
 'angle' 'antenna' 'aperture' 'apparatus' 'apparatus includes'
 'apparatuses' 'application' 'applications' 'applied' 'applying' 'area'
 'areas' 'arm' 'arranged' 'arrangement' 'array' 'article' 'aspect'
 'aspects' 'assembly' 'assembly includes' 'asset' 'assigned' 'associated'
 'attached' 'attachment' 'attributes' 'audio' 'authentication'
 'automatically' 'available' 'axis' 'band' 'bandwidth' 'barrier' 'base'
 'base station' 'based' 'basis' 'battery' 'beam' 'bearing' 'behavior'
 'bias' 'bit' 'bits' 'blade' 'block' 'blocks' 'board' 'body' 'box'
 'broadcast' 'buffer' 'bus' 'cable' 'cache' 'calculated' 'calibration'
 'camera' 'candi

Now report the 25 most frequently used words:

In [172]:
#the following are the 25 most frequently used words
dtm_total = dtm_df.sum(axis=0)
dtm_total.sort_values(ascending=False).head(25)

second         5697
device         4828
includes       4300
data           3938
method         2907
plurality      2322
based          2318
configured     2275
signal         2045
user           1972
portion        1949
information    1925
layer          1642
control        1596
network        1574
provided       1555
include        1493
image          1479
having         1453
surface        1437
unit           1378
including      1305
power          1252
associated     1223
methods        1204
dtype: int64

### STEP 5: Which words correlate with value?
The final part of the assignment requires you to correlate each word with value, as measured by `xi_real`. Then, report the 10 words that correlate most positively and most negatively (20 total). I recommend taking the following approach:
1. Write a function that correlates two arrays (or series) and returns the correlation
2. Loop over each word in the vocabulary and store the correlation coefficient in the container of your choosing
3. Create a pandas Series with the final results, where the index of the series is the word.
4. Sort the series in ascending order, and examine the `head` (most negative) and `tail` (most positive)

In [175]:
dtm2 = dtm_df.divide(dtm_total,axis=1)
dtm2

Unnamed: 0,access,accordance,according,account,acid,acquisition,action,actions,active,activity,...,wellbore,width,window,wire,wireless,wireless communication,wireless device,work,write,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6117,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


First, come up with your function to return a correlation between the column of word counts and patent values. Test the function on a sample word.

In [178]:
def mycorr(a,b):
    # fill in function dtm_df, sub2['xi_real']
    # Create new DTM that is proportions
    dtm2 = dtm_df.divide(dtm_total,axis=1)
    sample_word = dtm2.columns[4]
    corrs = pd.concat([dtm2[sample_word],sub2['xi_real']],axis=1).corr()
    
    return corrs # fill in return object
# Test your function
mycorr(dtm2.columns[4], sub2['xi_real'])

Unnamed: 0,acid,xi_real
acid,1.0,0.013854
xi_real,0.013854,1.0


Now complete the rest of  the task using whatever method you prefer (**HINT**: I recommend looping over words in the vocabulary and collecting correlations):

In [181]:
corrs = []
for word in dtm2.columns:
    #pd.concat([dtm2[sample_word],reviews['reviews.rating']],axis=1).corr().values[0,1]
    corrs.append(pd.concat([dtm2[word],sub2['xi_real']],axis=1).corr().values[0,1])
corrs_series = pd.Series(corrs,index=dtm2.columns)
corrs_series
# corrs = []
# for word in dtm2.columns:
#     corrs.append(pd.concat([dtm2[word],reviews['reviews.rating']],axis=1).corr().values[0,1])
# corrs_series = pd.Series(corrs,index=dtm2.columns)
# corrs_series

access                    0.007964
accordance               -0.001560
according                 0.022472
account                   0.020560
acid                      0.013854
                            ...   
wireless communication    0.006101
wireless device          -0.017110
work                     -0.007918
write                    -0.009794
zone                      0.012009
Length: 1000, dtype: float64

What words exhibit the most negative correlations?

In [184]:
#the following are the most negative correlations
corrs_series.sort_values(ascending=False).tail(10)

disposed        -0.030405
semiconductor   -0.031714
dielectric      -0.032494
coupled         -0.033243
gate            -0.034767
formed          -0.038043
second          -0.039260
forming         -0.040417
layer           -0.043806
substrate       -0.045291
dtype: float64

What words exhibit the most positive correlations?

In [187]:
#the following are the most positive correlations
corrs_series.sort_values(ascending=False).head(10)

card            0.079100
container       0.071155
screen          0.066077
managing        0.066072
point           0.063102
selected        0.050564
particularly    0.049771
methods         0.048988
destination     0.047501
transaction     0.047142
dtype: float64

**FINAL QUESTION**
From this simple analysis, does it appear that the text in the abstract correlates much with value? Do any correlations stand out as particularly intuitive? surprising? Provide a brief response below:

**ANSWER:**
*It's inconclusive to say that Text in the abstract correlates perfectly with the xi_values. However, the negative correlation works are associated with chemicals and electricity which is interesting. The positive words don't really seem to have any clustering that is obvious to the naked eye. Overall, the text in the abstract do not seem to correlate much with value.*