# Stackexchange

### Introduction

In this lesson, we'll use data from the [stackexchange-postgres](https://github.com/Networks-Learning/stackexchange-dump-to-postgres) repository.

You can load the data into postgres by performing the following in the terminal.

> If the below does not work, you can also load the data into SQLlite, which we will move through below.

```bash 
git clone git@github.com:Networks-Learning/stackexchange-dump-to-postgres.git
cd stackexchange-dump-to-postgres/
```



Then install the required pip libraries in `requirements.txt`.
```bash
pip install -r requirements.txt
```

Then create the postgres database called `beerSO` through the command line.

```bash
createdb beerSO
```

And then from the `stackexchange-dump-to-postgres` folder (which you should be in) run the following:

```bash
python load_into_pg.py -s beer -d beerSO
```

Now if you connect to the `beerSO` database, you should be able to see the tables listed there.

```
psql beerSO
```
To display tables, you can run the following.
```/dt```

Don't worry the main tables we'll be working with are the `users`, `comments`, and `posts` tables. 

### Connecting to the database with postgres

With postgres, we can connect to the database with the following.

In [1]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://@localhost:5432/beerSO')

And a sample query, finding a couple users who created their account after 2020 looks like the following.

In [65]:
pd.read_sql("""select * from users where users.creationdate > '2020-01-01' limit 2""", engine)

Unnamed: 0,id,reputation,creationdate,displayname,lastaccessdate,websiteurl,location,aboutme,views,upvotes,downvotes,profileimageurl,age,accountid,jsonfield
0,10125,11,2020-03-02 16:43:11.310,David,2020-03-02 17:37:25.600,,,,0,0,0,,,17887353,
1,11428,1,2020-07-03 00:53:08.707,binarystone,2020-07-03 00:53:08.707,,,,0,0,0,https://i.stack.imgur.com/nmcUe.png,,18972238,


## Connecting with SQLite

> Run if the above did not work.

In [68]:
import sqlite3
conn = sqlite3.connect('stackexchange.db')
cursor = conn.cursor()

In [67]:
import pandas as pd
root_url = "https://raw.githubusercontent.com/jigsawlabs-student/curriculum-images/main/has-many-movies-lab/"
names = ['users', 'comments', 'posts', 'votes']
loaded_dfs = [pd.read_csv(f'./data/{name}.csv') for name in names]

In [69]:
for index, name in enumerate(names):
    loaded_dfs[index].to_sql(f'{name}', conn, index = False)

In [71]:
# pd.read_sql('select * from users limit 2', conn)

# Exploring our tables

We have a couple of key tables.  
* The comments table, which has a postid and a userid.
* The posts table which has an `owneruserid` who made the post, and information like score, viewcount.

Other tables to pay attention to are the posts tables, and the users table.

### Writing our queries

1. Begin by finding the number of users in the database that have a reputation over 100.

In [62]:
pd.read_sql('select count(*) from users where reputation > 100', engine)

# 3566

Unnamed: 0,count
0,3566


2. Next find the top five users with the highest average scores, and only include those users who have made more than 10 comments.  Display each user's displayname in the result.

In [37]:
pd.read_sql("""select displayname, avg(score) avg_score from comments join users on users.id = userid
group by userid, displayname having count(*) > 10 
order by avg_score desc limit 5""", engine)
# displayname	avg_score
# 0	user23614	2.727273
# 1	wax eagle	1.315789
# 2	user505255	1.307692
# 3	Lucas Kauffman	1.117647
# 4	Fishtoaster	1.047619

Unnamed: 0,displayname,avg_score
0,user23614,2.727273
1,wax eagle,1.315789
2,user505255,1.307692
3,Lucas Kauffman,1.117647
4,Fishtoaster,1.047619


In [41]:
pd.read_sql("""select * from users order by creationdate desc limit 2 """, engine)

Unnamed: 0,id,reputation,creationdate,displayname,lastaccessdate,websiteurl,location,aboutme,views,upvotes,downvotes,profileimageurl,age,accountid,jsonfield
0,14889,1,2022-12-03 12:12:38.130,Meridianbet .BE,2022-12-03 12:12:38.130,https://meridianbet.be/,,"<p>Address:</p>\n<p>1000 Brussel, Drukpersstra...",0,0,0,,,27132199,
1,14888,1,2022-12-03 11:23:08.130,Totally Committed,2022-12-03 11:23:08.130,,,<p>Address\nMinneapolis MN 55408\nPhone\n(612)...,0,0,0,,,27131923,


3. Next look at the posts table.  Find the `owneruserid`s of those users with the top five average scores, and include the average score.  Only consider those posts where the owneruser created their account after `'2019-01-01'`, and only include posts whose owneruser who had more than five posts. 

In [54]:
pd.read_sql("""select owneruserid, avg(posts.score) avg_score
from posts join users on users.id = posts.owneruserid
where users.creationdate > '2019-01-01'
group by owneruserid having count(*) > 5
order by avg_score desc limit 5""", engine)

Unnamed: 0,owneruserid,avg_score
0,8506,4.333333
1,8518,3.068966
2,14432,2.875
3,11663,2.714286
4,8672,2.2


4. Next find the number of users who have not made a comment.

In [57]:
pd.read_sql("""select count(*) from users 
left join comments on users.id = comments.userid
where comments.userid is null limit 2""", engine)

# 	count
# 0	8575

Unnamed: 0,count
0,8575


5. Finally, find the percentage of users who have not made a comment.

In [61]:
pd.read_sql("""select 1 - 1.0*count(comments.userid)/count(*) as users_without_comment from users 
left join comments on users.id = comments.userid""", engine)

# users_without_comment
# 0	0.687099


Unnamed: 0,users_without_comment
0,0.687099
