## Big Data - a Beginners' Lesson
### Segment 1 of 3
### WTH is Big Data?
#### In this segment, you will: 
*	Learn about and explore big data
*	Process and visualize big data

<i>Lesson Developers: </i>
<ul>
    <li>
    <i>Edwin Chow, chow@txstate.edu</i>
    </li>
    <li>
    <i>Jayakrishnan Ajayakumar, jxa421@case.edu</i>
    </li>
</ul>


In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

import warnings
warnings.filterwarnings('ignore') # Hide warnings

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
# HTML(''' 
#     <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
#     <input id="toggle_code" type="button" value="Toggle raw code">
# ''')

HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')


## Thank you for helping our study


<a href="#/slide-1-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

Throughout this lesson you will see reminders, like the one below, to ensure that all participants understand that they are in a voluntary research study.

### Reminder

<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

## WTH is Big Data?
The image below is a  'heatmap'  of something.

In [None]:
import time
from IPython.display import clear_output
widget1 = widgets.RadioButtons(
    options = ['Temperature', 'Road', 'Night light', 'Social media post'],
    description = '<p style="display:inline;font-size:20px"> Guess what does <p style="display:inline; color:#FFA500;font-size:20px">orange</p>/<p style="display:inline;color:#1E90FF;font-size:20px">blue</p>/<p style="display:inline; color:#F0FFFF;font-size:20px; background-color:#000000;">white</p> <p style="display:inline; font-size:20px;">represent?...</p>', style={'description_width': 'initial'},
    layout = Layout(width='100%'),
    value = None
)

display(widget1)

# hourofci.SubmitBtn2(widget1)

def SubmitBtn(widget):
    button = widgets.Button(
        description = 'Submit',
        layout=Layout(width='auto', height='auto'),
        disabled = False,
        button_style = '',
        icon = 'check'
    )
    
    display(button)
    output = widgets.Output()
    display(output) 
    
    def submit(b):
        clear_output()
        display(widget)
        display(button)
        display(output)
        print("Great! Move to the next slide to see the answer.")

        def countdown(t):
            while t:
                out.update(t)
                time.sleep(1)
                t -= 1

        out.update(countdown(int(20)))
        out.update(HTML(''' <br/>
            <a id='button' href="#/slide-4-0" class="navigate-right" style="background-color:Green;color:white;padding:8px;margin:2px;font-weight:bold;">Nice try! Continue to see the answer!</a>
        '''))
    
    button.on_click(submit)
    
SubmitBtn(widget1)



<table>
    <tr style="background: #fff; text-align: left; vertical-align: top;">
        <td style="width: 100%; background: #fff; text-align: center; vertical-align: top;"><center> <img src='supplementary/heatmap.jpg' width="700" height="900" alt='map'/></td>
    </tr>
</table>


## WTH is Big Data?

<p style="display:inline;">This is a heatmap of geotagged social media posts, where <p style="display:inline; color:#FFA500;">Orange = flickr</p>, <p style="display:inline;color:#1E90FF;">blue = tweet</p>, and <p style="display:inline; color:#F0FFFF; background-color:#000000;">white = both</p>. 
<br/>Do you see any spatial pattern(s)? </p>

<table>
    <tr style="background: #fff; text-align: left; vertical-align: top;">
        <td style="width: 100%; background: #fff; text-align: left; vertical-align: top;"> <center><img src='supplementary/heatmap.jpg' width="700" height="900" alt='map'></td>
    </tr>
</table>



In [None]:
class Output:
    def __init__(self, name='countdown'):
        self.h = display(display_id=name)
        self.content = ''
        self.mime_type = None
        self.dic_kind = {
            'text': 'text/plain',
            'markdown': 'text/markdown',
            'html': 'text/html',
        }
        
    def display(self):
        self.h.display({'text/plain': ''}, raw=True)
        
    def _build_obj(self, content, kind, append, new_line):
        self.mime_type = self.dic_kind.get(kind)
        if not self.mime_type:
            return content, False
        if append:
            sep = '\n' if new_line else ''
            self.content = self.content + sep + content
        else:
            self.content = content
        return {self.mime_type: self.content}, True
        
    def update(self, content, kind=None, append=False, new_line=True):
        obj, raw = self._build_obj(content, kind, append, new_line)
        self.h.update(obj, raw=raw)
    
print('\033[1m','Think about it for 20 seconds!')
out = Output(name='countdown')
out.display()

Tell us what you thought in the text area below.


In [None]:

w = widgets.Textarea(
            value='',
            placeholder='Type your answer here',
            description='',
            disabled=False,
            layout=Layout( height='100px', min_height='100px', width='900px')
            )


def out1():
    print('Submitted!')
    
display(w)
hourofci.SubmitBtn2(w, out1)

## WTH is Big Data?

<p style="display:inline;">This “heat” map tells us a lot about <p style="display:inline; color:#ff0000;">PEOPLE</p>!! </p>


<table>
    <tr style="background: #fff; text-align: left; vertical-align:">
        <td style="width: 50%; background: #fff; text-align: left; vertical-align: top;"> <img src='supplementary/heatmap.jpg' width="700" height="900" alt='map'></td>
        <td style="background: #fff; text-align: left; font-size: 24px;">What do these pattern(s) tell us?
            <br/><strong>1.	Where people are</strong> <br/>
   &nbsp;&nbsp;&nbsp;&nbsp; → notice how big cities and transportation network show up
            <br/><strong>2.	What people share </strong> <br/>
   &nbsp;&nbsp;&nbsp;&nbsp; → flickr vs tweets <br/>
Social media is an example of <p style="display:inline; color:#0096FF;">BIG DATA</p>.
</td>
    </tr>
    </tr>
</table>



## WTH is Big Data?


<table><br/><br/>
    <tr style="background: #fff; text-align: left; vertical-align: top;"><p style="display:inline; color:#0096FF; font-size: 24px;">Definition:</p>
        <td style="background: #fff; text-align: left; font-size: 24px; vertical-align: top;"><i> Datasets that are often characterized as a large volume of complex data produced at an accelerating pace. The definition is often characterized by “The 3 Vs” that are illustrated in the graphic.</i></td>
        <td style="width: 50%; background: #fff; text-align: left; vertical-align: top;"> <center><p style="display:inline; color:#0096FF; font-size: 24px;">The 3Vs of Big Data</p><img src='supplementary/3v.png' width="700" height="900" alt='map'></td>
    </tr>
    
</table>



## The ‘V’s of Big Data: Volume
<br/>

<p style="display:inline; color:#0096FF; font-size: 20px;">Volume</p> <p style="display:inline; font-size: 20px;">- the metric to measure data volume of big data at the scale of astronomical units (e.g. petabytes, exabytes, zettabytes, yottabytes)


<table><br/><br/>
    <tr style="background: #fff; text-align: left; vertical-align: top;"><td style="width: 50%; background: #fff; text-align: left; vertical-align: top;"> <img src='supplementary/dobrilova.png' width="700" height="900" alt='map'></td>
        <td style="background: #fff; text-align: left; font-size: 18px; ">  
            <ul>
                <li>2022 Figures (Dobrilova 2022)</li> 
            <ul>
                <li>Facebook: 4.2M likes; 211k new photos</li>
                <li>Instagram: 347k browsing; 44k new photos</li>
                <li>Twitter: 87.5k new tweets</li>
                <li>Tumbler: 37k new posts</li>
                <li>Youtube: 4.5M videos watched; 1000 hrs of new videos uploaded</li>
                <li>Netflix: 694k hrs of video watched</li>
                <li>Texting: ~60M texts sent</li></ul>
           <li>How many messages/posts/videos are there every day/month/year?</li>
                </ul>
 </td>
    </tr>
</table>
<br>
Note for ED:<p style="display:inline; color:red;">a note to check that the citation is given somewhere in the lesson. And/OR better would be to just put it at the bottom of this slide. It would be useful for learners to see where this information comes from, they won’t bother looking for it or at it later. 
This figure is almost 10 years old. That should be noted and then the updated numbers will be more impressive.
 </p>


## The ‘V’s of Big Data: Velocity
<br/>
<p style="display:inline; color:#0096FF; font-size: 20px;">Velocity</p> <p style="display:inline; font-size: 20px;">- The rate at which big data are generated over time. Watch the following video:</p>

<br>Note for ED:<p style="display:inline; color:red;">video needs an introduction. Not clear what’s going on or what I’m supposed to observe by watching it. 
 </p><br>


In [None]:
from IPython.display import YouTubeVideo
# print('Watch: OpenStreetMap for Haiti 12th Jan 2010')
# YouTubeVideo('e89Tqr75mMw', width=800,height=480)
YouTubeVideo('BxS3zV3_STQ', width=800,height=480)


## The ‘V’s of Big Data: Variety
<br/>

<p style="display:inline; color:#0096FF; font-size: 20px;">Variety</p> <p style="display:inline; font-size: 20px;">- The degree of heterogeneity in how big data are encoded, structured, formatted and represented.</p>

<br>Note for ED:<p style="display:inline; color:red;">what’s the attribution of this image? (and all images!) Add a line at the bottom that gives its source. 
 </p><br>

<table>
    <tr style="background: #fff; text-align: left; vertical-align:">
        <td style="width: 60%; background: #fff; text-align: left; vertical-align: top;"> <img src='supplementary/Variety.png' alt='map'></td>
        <td style="background: #fff; text-align: left; font-size: 21px;">
            Big data can be any combination of various digital data, such as
            <li>
                 text, 
            </li>
            <li>
                image, 
            </li>
            <li>
                video, an 
            </li>
            <li>
                audio
            </li>
            <li>
                location
            </li>
            <li>
                measurement
            </li>
            <li>
                date & time
            </li>
            <li>
                ...
            </li>
</td>
    </tr>
    </tr>
</table>

## The [Other] ‘V’s of Big Data: Value
<br/>
<p style="font-size: 20px;">Besides “The 3 V’s”, there are other V’s that people talk about.</p>
<p style="display:inline; color:#0096FF; font-size: 20px;">Value</p> <p style="display:inline; font-size: 20px;">- The usefulness of big data in providing unique insights to problem solving and/or decision making. 
    

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('rwOIQzcXx7Y', width=800,height=480)

Explore the tools here: https://coronavirus.jhu.edu/covid-19-daily-video 
<ul>
                <li>Where are COVID-19 cases rising?</li>
                <li>What are the trends of COVID-19 cases and testing?</li>
                <li>Which countries have flattened the curves?</li>
</ul>
<br> Note for ED:
<p style="display:inline; color:red; font-size: 20px;">again the video needs a couple of sentences of context so the viewer will understand (particularly for young learners who may not know about the JH data). </p>


## The [Other] ‘V’s of Big Data: Veracity
<br/>

<p style="display:inline; color:#0096FF; font-size: 20px;">Veracity</p> <p style="display:inline; font-size: 20px;">- The quality of big data and its implications to subsequent application.</p>


<table>
  <tr style="background-color:transparent">
    <td style="width: 50%">
        <img src='supplementary/Veracity.png' width="600"/>
    </td>
    <td style="padding-right:10px; width:700px">
        <ul style="text-align: left; font-size: 20px; ">
            <strong>Examine the emojis:</strong>
            <li>Do you agree/disagree? Why? </li> 
            <ul>
                <li>Grapes in TX</li>
                <li>Snowman in D.C.</li>
            </ul>
         </ul>
        <ul style="text-align: left; font-size: 20px; ">
               <strong>To understand the biases, think about the following questions:</strong>
               <li>Who produced the data?</li>
               <li>When was the survey conducted?</li>
        </ul>
        <p style="text-align: left; font-size: 20px; ">
            Write your reflection in the text area in the next slide.</p>
    </td>
  </tr>
</table>

<br>Note for ED:<p style="display:inline; color:red;">again the image needs attribution, in particular we need to know the date. Or is that left off on purpose?
 </p><br>




In [None]:

w2 = widgets.Textarea(
            value='',
            placeholder='Type your answer here',
            description='',
            disabled=False,
            layout=Layout( height='100px', min_height='100px', width='900px')
            )


def out2():
    print('Submitted!')
    
display(w2)
hourofci.SubmitBtn2(w2, out2)

## The [Other] ‘V’s of Big Data: Veracity



<br/>

<p style="display:inline; color:#0096FF; font-size: 20px;">Veracity</p> <p style="display:inline; font-size: 20px;">- We should be aware of any biases (e.g. sampling) and quality issues.</p>

<!-- <table>
    <tr style="background: #fff; text-align: left; vertical-align: top;"><td style="width: 20%; background: #fff; text-align: left; vertical-align: top;"> <img src='supplementary/Veracity.png' ></td>
    <td style="width: 20%; background: #fff; text-align: left; vertical-align: top;"> <img src='supplementary/veracity2.png' ></td>

</table> -->
<br><br><br>
Note for ED:<p style="display:inline; color:red;">I think this slide would be better with just several points hinted at in the right side graphic, delete both images. And be sure to include a point that gives some sense of how biases arise (something like “85% of twitter influencers are white women under 25” [made up, NOT a fact!] - how do you think this is reflected in the overall content of tweets? You’ll need to find a couple of such juicy facts, with references).  </p>

## The [Other] ‘V’s of Big Data: Visualization

<br/>
<p style="display:inline; color:#0096FF; font-size: 20px;">Visualization</p> <p style="display:inline; font-size: 20px;">- A data rendering process to highlight the spatial, temporal and/or thematic pattern of big data through charts, graphics and creative illustrations.</p>
<br/>
<br/>
<p style="display:inline; font-size: 15px;">Explore the JHU <a href="https://coronavirus.jhu.edu/us-map">COVID-19 Dashboard</a> below and answer the following questions:
  <br/>  
<ul style="font-size: 15px;">
        <li>Where are the hotspots/coldspots?</li>
        <li>Using the left panel, which county has the highest confirmed cases?</li>
        <li>Click on that county (or any county) in the map</li>
        <li>In the popup window, scroll down to see the infographics</li>
        <li>Click it to open up a new tab, examine the infographics</li>
        <li>Which visualization tool(s) helps you to understand the data the best?</li>
</ul></p>
<!-- <center><img src='supplementary/dashboard.png' alt='dashboard' width="1000" height="800"> -->


In [None]:
%%html
<iframe src = "https://www.arcgis.com/apps/dashboards/409af567637846e3b5d4182fcd779bea" width="100%" height="500"></iframe>


## Show Me the Data!
<body style="display:inline; font-size: 15px;">In the JHU COVID-19 Dashboard:
<br/>
<ul>
        <li>Are there any missing data? Why?</li>
        <li>Scroll down the bottom panel, click the link “Downloadable Database: Github”</li>
        <li>Examine the data sources</li>
        <li>Up at the top, click on “csse_covid_19_data” folder</li>
        <li>Click into the “csse_covid_19_daily_reports” folder</li>
        <li>Find the .csv with today’s date and click into it</li>
        <li>Examine the data</li>
</ul></body>

<center><img src='supplementary/gitshot.png' alt='git' width="800" height="800">

<br>    
Note for ED:<p style="display:inline; color:red;">I don’t see the downloadable database link. Is this instruction still current?
I think this series of slides needs to tell them to open the JHU link in a separate window and to switch between that and the lesson, rather than including anything static (other than images that show them what to look at).  </p><br>

    


<body style="display:inline; font-size: 15px;">Look at the data and examine the followings:
<br/>
<ul>
        <li>Are cases up or down?</li>
        <li>Are some countries doing worse/better?</li>
        <li>What types of data are in the spreadsheet?</li>
        <li>Are they easier to understand with the visualization?</li>
</ul>
</body>
 
Write down your answers in the following text area.


In [None]:

w4 = widgets.Textarea(
            value='',
            placeholder='Type your answer here',
            description='',
            disabled=False,
            layout=Layout( height='100px', min_height='100px', width='900px')
            )


def out4():
    print('Submitted!')
    
display(w4)
hourofci.SubmitBtn2(w, out4)

## Summary
That's it!! You have learned:

<ul>
    <li>What big data is (with some examples, e.g. social media posts)</li>
    <li>The 3Vs or 6Vs of big data</li>
    <li>Explore and examine some big data of public health</li>
</ul>

Feel free to:
<ul>
    <li>Explore other types of big data</li>
    <li>Go to the next notebook to learn about big data processing</li>
</ul>
<br/>

<br><br><br>
Note for ED:<p style="display:inline; color:red;">you tell them to explore other kinds of big data, but I think you need to give some links.  </p>


<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="bigdata-3.ipynb">Click here to go to the next notebook.</a></font>
