# Stackexchange

### Introduction

In this lesson, we'll use data from the [stackexchange-postgres](https://github.com/Networks-Learning/stackexchange-dump-to-postgres) repository.

### Connecting to the database with postgres

With postgres, we can connect to the database with the following.

In [1]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://@localhost:5432/beerSO')

And a sample query, finding a couple users who created their account after 2020 looks like the following.

In [65]:
pd.read_sql("""select * from users where users.creationdate > '2020-01-01' limit 2""", engine)

Unnamed: 0,id,reputation,creationdate,displayname,lastaccessdate,websiteurl,location,aboutme,views,upvotes,downvotes,profileimageurl,age,accountid,jsonfield
0,10125,11,2020-03-02 16:43:11.310,David,2020-03-02 17:37:25.600,,,,0,0,0,,,17887353,
1,11428,1,2020-07-03 00:53:08.707,binarystone,2020-07-03 00:53:08.707,,,,0,0,0,https://i.stack.imgur.com/nmcUe.png,,18972238,


## Connecting with SQLite

> Run if the above did not work.

In [3]:
import sqlite3
conn = sqlite3.connect('stackexchange.db')
cursor = conn.cursor()

In [4]:
import pandas as pd
root_url = "https://raw.githubusercontent.com/sql-fundamentals-jigsaw/mod-1-sql-curriculum/master/5-stackexchange/data/"
names = ['users', 'comments', 'posts', 'votes']
loaded_dfs = [pd.read_csv(f'{root_url}{name}.csv') for name in names]

In [6]:
for index, name in enumerate(names):
    loaded_dfs[index].to_sql(f'{name}', conn, index = False, if_exists = 'replace')

In [8]:
pd.read_sql("""select * from users 
where users.creationdate > '2020-01-01' limit 2""", conn)

Unnamed: 0,id,reputation,creationdate,displayname,lastaccessdate,websiteurl,location,aboutme,views,upvotes,downvotes,profileimageurl,age,accountid,jsonfield
0,10125,11,2020-03-02 16:43:11.31,David,2020-03-02 17:37:25.6,,,,0,0,0,,,17887353,
1,11428,1,2020-07-03 00:53:08.707,binarystone,2020-07-03 00:53:08.707,,,,0,0,0,https://i.stack.imgur.com/nmcUe.png,,18972238,


# Exploring our tables

We have a couple of key tables.  
* The comments table, which has a postid and a userid.
* The posts table which has an `owneruserid` who made the post, and information like score, viewcount.

Other tables to pay attention to are the posts tables, and the users table.

### Writing our queries

1. Begin by finding the number of users in the database that have a reputation over 100.

In [11]:
query = """ """
# pd.read_sql(query, conn)

# 3566

2. Next find the top five users with the highest average scores, and only include those users who have made more than 10 comments.  Display each user's displayname in the result.

In [37]:
pd.read_sql("""
""", conn)
# displayname	avg_score
# 0	user23614	2.727273
# 1	wax eagle	1.315789
# 2	user505255	1.307692
# 3	Lucas Kauffman	1.117647
# 4	Fishtoaster	1.047619

Unnamed: 0,displayname,avg_score
0,user23614,2.727273
1,wax eagle,1.315789
2,user505255,1.307692
3,Lucas Kauffman,1.117647
4,Fishtoaster,1.047619


3. Next look at the posts table.  Find the `owneruserid`s of those users with the top five average scores, and include the average score.  Only consider those posts where the owneruser created their account after `'2019-01-01'`, and only include posts whose owneruser who had more than five posts. 

In [54]:
pd.read_sql("""
""", conn)

# 	owneruserid	avg_score
# 0	8506	4.333333
# 1	8518	3.068966
# 2	14432	2.875000
# 3	11663	2.714286
# 4	8672	2.200000

Unnamed: 0,owneruserid,avg_score
0,8506,4.333333
1,8518,3.068966
2,14432,2.875
3,11663,2.714286
4,8672,2.2


4. Next find the number of users who have not made a comment.

In [57]:
pd.read_sql("""
""", conn)

# 	count
# 0	8575

Unnamed: 0,count
0,8575


5. Finally, find the percentage of users who have not made a comment.

In [61]:
pd.read_sql("""
""", conn)

# users_without_comment
# 0	0.687099


Unnamed: 0,users_without_comment
0,0.687099
