#### The Two Files and Their Fields

You have two files, one containing review data, the other containing product data.
The format of these files is unusual:  each file has one line per "data row," and a data row is either a single review or a single product, depending on the file.

Each line can be converted to a Python dictionary using this 
code:

``eval('(' + line + ')')``

By iterating over all of the lines in the review data file and evaluating each line, you can build a list of dictionaries, one per review.  In other words, this is roughly the same as scraping the review pages. And likewise for the product pages data file.  You will use the data in these two lists of dictionaries to index your two SOLR collections.

One difference is that when you scrape a page you can select only the attributes that are interesting to your application.  When you read these files in, they will have some extra fields, which you will have to remove from your dictionary (because those fields will not be in your SOLR schema, and SOLR will complain if you send it fields that are not part of its schema).

Not all of the fields are interesting to us in the sense we will use them on our web site;  here is information about what the fields mean, and which you should keep and which you should discard (ignore)

##### Reviews Data

Your solution will only use the following fields;  ignore any others in the data

| Field | Type | Note |
|-------|------|------|
| id | UUID | Not in the input file;  supplied by SOLR the same as you did for Assignment 1
| asin | string | Product ID.  Joins with the asin field in the product file |
| reviewText | text | Full review body |
| overall | integer | Average rating.  Truncate the floating-point value |
| summary | text | Review summary text |

Unlike the last assignment, we are omitting the review time from our reviews documents.

##### Products
 
Your solution will only use the following fields;  ignore any others in the data

| Field | Type | Note |
|-------|------|------|
| asin | string | Unique ID for products.  Joins with the asin field in the reviews file. |
| description | string | Stored but not indexed.  Shown on product detail page but not searchable. |
| title | string | This is the product name.  Stored but not indexed.  Shown on the product detail page and also on review search result and detail pages. |
| price | float | Displayed in currency format on the product detail page.|

Notice that for product data, only the ASIN field in products is being indexed.  That means you can't do any kind of search on Products except an ASIN lookup.  Of course in a real e-commerce application you would want to search on product attributes too, but for this assignment we are keeping it simple and only searching on reviews.  

Be aware of the implications of this decision -- if for example you do a review search on 'iphone' you will only find reviews that explicitly contain the term 'iphone' in the review summary or the review text, even though the product itself might have 'iphone' in the title.

### The Code and How it Will be Used

The main deliverables for this assignment are
* A directory containing the configuration for your *reviews* collection
* A directory containing the configuration for your *products* collection
* A directory containing your Flask project

Please be careful to name things exactly like this:
* The directory containing the configuration for your reviews collection will have the name *reviews*
* The name of the SOLR collection will also be named *reviews*
* The directory containing the configuration for your products collection will have the name *products*
* The name of the SOLR collection will also be named *products*
* The name of the directory containing your Flask project will be named *reviewsite*

This notebook will contain some code to "scrape" the data files and prepare dictionaries for indexing.  Please note that in this notebook you will just write the function definitions  to prepare the dictionaries, similar to the function *scrapePagesForReviews* you wrote for Assignment 1.   

Also note that in your code and in your SOLR schemas you must use the exact names in the documentation above.  For example, your *reviews* schema must have fields named *id*, *asin*, *reviewText*, *overall*, and *summary*.  Be careful about capitalization.  The  *products* schema must have fields named *asin*, *description*, *title*, and *price*.  

When you "scrape" products and reviews you must also follow the following rules, which filter out records with bad data.  In the rules below, "empty" means either the value of the attribute is an empty string, or the attribute is missing altogether.

For products
* Do not include a product if its asin or title fields are empty.  An empty description is OK.
* It is OK to have an empty price, but you need to check that the string value you read is a valid positive floating point number.  If the value is not a valid positive floating point number, omit the attribute

For reviews
* Do not include a review if its asin, summary, or reviewText attributes are empty
* It is OK to have an empty *overall* attribute, but you need to check that the string value you read can be converted to an integer between 1 and 5.  If the value cannot be converted to an integer between 1 and 5, omit the attribute
* Do not include a review if there is not a product with the same asin
   * This last restriction is tricky -- give some thought to its implementation!

In [150]:
##############################################################
#  These functions read the data files and produce lists of dictionaries, which
#  will be passed to SOLR for indexing.   So the key names in the dictionaries
#  must correspond to SOLR fields, and the values must be of the correct type 
#  according to the SOLR schema.

import os


PRODUCTFILE = "test-products.txt"
REVIEWFILE = "test-reviews.txt"
productASIN = []

def getPrice(proDic):
    try:
        return float(proDic.get('price', 0.0))
    except:
        return -1.0

def getOverall(reviewDic):
    try:
        overall = int(reviewDic.get('overall'))
        if overall >=1 and overall <=5:
            return overall
        else:
            return 0
    except:
        return 0
    
def productJSON(filename):
    products =[]
    with(open(filename, "r")) as proFile:
        for line in iter(proFile.readline, ''):
            product = eval('(' + line + ')')
            proDic = { 'asin': product['asin'],
                       'title': product.get('title',''),
                       'description':product.get('description',''),
                       'price': getPrice(product)
                
            }
            if (proDic['asin']!='' and proDic['title']!='' and proDic['price']>=0.0):
                products.append(proDic)
                productASIN.append(proDic['asin'])
    print(products[0])           
    return products

In [151]:
def reviewJSON(filename):
    reviews =[]
    with(open(filename, "r")) as revFile:
        for line in iter(revFile.readline, ''):
            review = eval('(' + line + ')')
            reviewDic = {'asin': review['asin'],
                          'reviewText': review.get('reviewText',''),
                         'overall': getOverall(review),
                         'summary': review.get('summary','')
                
            }
            if reviewDic['asin']!='' and \
                reviewDic['asin'] in productASIN and \
                reviewDic['reviewText']!=''and \
                reviewDic['overall']>0:
                reviews.append(reviewDic)
          

    print(reviews[0])
    return reviews

In [137]:
import subprocess


SOLR_EXECUTABLE = 'C:\\solr-8.11.1\\solr-8.11.1\\bin\\solr.cmd'
SOLR_PRODUCTS = 'http://localhost:8983/solr/products'
SOLR_REVIEWS = 'http://localhost:8983/solr/reviews'

config_products_loc = os.getcwd() +'\\products\\conf'
config_reviews_loc = os.getcwd() +'\\reviews\\conf'

def solr_command(*args):
    return subprocess.check_output([SOLR_EXECUTABLE] + list(args))


solr_command('create_core', '-c', 'products', '-d', config_products_loc)
solr_command('create_core', '-c', 'reviews', '-d', config_reviews_loc)

C:\Users\xiexi\JupyterNotebook\TextProcessing\HW\Assignment2\products\conf


b"\nCreated new core 'products'\r\n"

In [139]:
import pysolr
pr = productJSON(PRODUCTFILE)
pysolr.Solr(SOLR_PRODUCTS).add(pr, commit=True)


re =reviewJSON(REVIEWFILE)
pysolr.Solr(SOLR_REVIEWS).add(re, commit=True)

'{\n  "responseHeader":{\n    "status":0,\n    "QTime":3042}}\n'

### How I Will Run Your Solution

I will start in a directory that contains
* This notebook (the notebook you hand in)
* Your two configuration directories you hand in, which must be named *products* and *reviews*
* The directory you hand in containing your Flask application, which must be named *reviewsite*
* My test data files, which might be named *testProductData.txt* and *testReviewData.txt*

SOLR will be running and have no collections defined.

I will do the following
* Create the *products* and *reviews* collections using the directories you provided.  For example
<pre>
solr create_collection -c products -d products
solr create_collection -c reviews -d reviews
</pre>
* Index "documents" using the "scraping" code you defined above, pointing your code to my test files.  For example.
<pre>
prod = productJSON("my-test-product-data.txt")
solr = pysolr.Solr('http://localhost:8983/solr/products')
solr.add(prod, commit=True)
</pre>
<pre>
rev = reviewJSON("my-test-review-data.txt")
solr = pysolr.Solr('http://localhost:8983/solr/reviews')
solr.add(rev, commit=True)
</pre>
* Start Flask, pointing it at your project directory.  For example.
<pre>
set FLASK_APP=reviewsite
set FLASK_DEBUG=1
flask run
</pre>
* Then I will do searches and lookups and overall test the functionality of your site on the test data set

### To Prepare This Notebook to Hand In
Your notebook should contain only two cells, in this order
1.  A Markdown cell containing your name, the course number and name, the quarter, and identifies it as a solution for Assignment 2
2.  A Code cell that contains your function definitions for *productJSON* and *reviewJSON*