# Scraping AJAX 
A practice for scraping info from web https://corp.sos.ms.gov/corp/portal/c/page/corpBusinessIdSearch/portal.aspx?#clear=1

## AJAX request
1. Open the page above on Chrome. Then open View -> Developer -> Developer Tools.
2. In developer tools, click Network.
3. Search something pretty simple on original page (for example: "a").
4. In developer tools, we will see a file called "BusinessNameSearch", click this file.
5. In General section, there is Request URL. This is the url we will used later in python requests.
6. In Request Payload, we will see the request format and which we will use python requests to simulate.

## Simple Version for Diagnostic
This is the simpliest code only to see if we can get some reponse or not.

In [1]:
import requests
import json

temp = "a"
r = requests.post(
    url='https://corp.sos.ms.gov/corp/Services/MS/CorpServices.asmx/BusinessNameSearch',
    json = { 
        'BusinessName': temp,
        'SearchType': "startingwith"}
    
)

If everything works well, we can type 
```python
r.json()
```
and see the list of rearching results.

Then, we convert result from str to a pd.dataframe.

In [146]:
import ast # eval function
import pandas as pd

res_str = r.json()["d"]

res = ast.literal_eval(res_str[11:-1])  # list of dict

df = pd.DataFrame(res)

## Recursive on starting letter A-Z

In [None]:
import requests
import json
import ast
import pandas as pd

output = pd.DataFrame(columns = ['BusinessFormedDate', 'BusinessId', 'BusinessName', 'EntityId',
       'FilingId', 'FilingStatus', 'FilingTypeId', 'FilingtypeName',
       'NameType'])

for i in range(ord('a'), ord('z')+1):
    temp = chr(i)

    r = requests.post(
        url='https://corp.sos.ms.gov/corp/Services/MS/CorpServices.asmx/BusinessNameSearch',
        json = { 
            'BusinessName': temp,
            'SearchType': "startingwith"}

    )
    
    res_str = r.json()["d"]
    res = ast.literal_eval(res_str[11:-1])  # list of dict
    df = pd.DataFrame(res)
    output = output.append(df, ignore_index=True)

output.to_csv("MS_Business_info.csv", index=False)

## references
http://toddhayton.com/2015/03/11/scraping-ajax-pages-with-python/

## Issues
1. Web do not support regular expression nor space. Need to type some letters for searching.
2. Every time it only response 2000 results.
3. No details for the searching results.

## Recursive
A recursive function that searching on all strings with prefix given. 
If the result row number is greater than 2000, the current strings are treated as prefix for following the searching.

In [None]:
import requests
import json
import ast
import pandas as pd

class scrap:
    def __init__(self):
        self.output = pd.DataFrame(columns = ['BusinessFormedDate', 'BusinessId', 'BusinessName', 'EntityId',
                                              'FilingId', 'FilingStatus', 'FilingTypeId', 'FilingtypeName',
                                              'NameType'])
        
        self.letters = list(map(chr, range(ord("a"), ord("z")+1))) + list(map(str,range(10)))
        
    def foo(self,pre):
        for letter in self.letters:
            inpt = pre+letter
            print(inpt)
            r = requests.post(
                url='https://corp.sos.ms.gov/corp/Services/MS/CorpServices.asmx/BusinessNameSearch',
                json = { 
                    'BusinessName': inpt,
                    'SearchType': "startingwith"}
            )
            res_str = r.json()["d"]
            if res_str != '""':
                res = ast.literal_eval(res_str[11:-1])
                df = pd.DataFrame(res)
                if df.shape[0] == 2000:
                    self.foo(inpt)
                else:
                    self.output = self.output.append(df)
    
    def res(self):
        return(self.output)

### Example

In [None]:
a = scrap()
a.foo("am")
df = a.res()
df.shape