<h3> HTTP POST method <h2>

In [1]:
# execute a normal GET request before sending the information through a POST, though this is not even required.
import requests
url = 'http://www.webscrapingfordatascience.com/postform2/'
# First perform a GET request
r = requests.get(url)
# Followed by a POST request
formdata = {
'name': 'Seppe',
'gender': 'M',
'pizza': 'like',
'haircolor': 'brown',
'comments': ''
}
r = requests.post(url, data=formdata)
print(r.headers) #response
print(r.request.headers) #request
print(r.text)

{'Date': 'Mon, 28 Nov 2022 15:36:33 GMT', 'Server': 'Apache/2.4.41 (Ubuntu)', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '235', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}
{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '56', 'Content-Type': 'application/x-www-form-urlencoded'}
<html>
	<body>


<h2>Thanks for submitting your information</h2>

<p>Here's a dump of the form data that was submitted:</p>

<pre>array(5) {
  ["name"]=>
  string(5) "Seppe"
  ["gender"]=>
  string(1) "M"
  ["pizza"]=>
  string(4) "like"
  ["haircolor"]=>
  string(5) "brown"
  ["comments"]=>
  string(0) ""
}
</pre>


	</body>
</html>



<pre>To illustrate this, try navigating to http://www.
webscrapingfordatascience.com/postform3/, fill out, and submit the form. Now try
the same again but wait a minute or two before pressing “Submit my information.” The
web page will inform you that “You waited too long to submit this information.” Let’s try
submitting this form using requests:</pre>

In [2]:
import requests
url = 'http://www.webscrapingfordatascience.com/postform3/'
# No GET request needed?
formdata = {
'name': 'Seppe',
'gender': 'M',
'pizza': 'like',
'haircolor': 'brown',
'comments': ''
}
r = requests.post(url, data=formdata)
print(r.text)# Will show: Are you trying to submit information from somewhere else?

<html>
	<body>


Are you trying to submit information from somewhere else?

	</body>
</html>



<b>PROBLEM</b> How does the server know that request is submitted from somewhere else in this case from python. The answer lies in this input tag as shown here: <pre>input type="hidden" name="protection" value="2c17abf5d5b4e326bea802600ff88405</pre>
Following program shows how to incoporate this input in out form POST data

In [4]:
import requests
url = 'http://www.webscrapingfordatascience.com/postform3/'
formdata = {
'name': 'Seppe',
'gender': 'M',
'pizza': 'like',
'haircolor': 'brown',
'comments': '',
'protection': '2c17abf5d5b4e326bea802600ff88405'
}
r = requests.post(url, data=formdata)
print(r.text)
# Will show: You waited too long to submit this information. Try

<html>
	<body>


You waited too long to submit this information. Try <a href="./">again</a>.

	</body>
</html>



<pre>Assuming you waited a minute before running this piece of code, the web server will
now reply with a message indicating that it doesn’t want to handle this request. Indeed,
we can confirm (using our browser), that the “protection” field appears to change every
time we refresh the page, seemingly randomly. To work our way around this, we have
no other alternative but to first fetch out the form’s HTML source using a GET request,
get the value for the “protection” field, and then use that value in the subsequent POST
request.</pre>

In [None]:
from bs4 import BeautifulSoup
url = 'http://www.webscrapingfordatascience.com/postform3/'
# First perform a GET request
r = requests.get(url)
# Get out the value for protection
html_soup = BeautifulSoup(r.text, 'html.parser')
p_val = html_soup.find('input', attrs={'name': 'protection'}).get('value')
# Then use it in a POST request
formdata = {
    'name': 'Seppe',
    'gender': 'M',
    'pizza': 'like',
    'haircolor': 'brown',
    'comments': '',
    'protection': p_val
}
r = requests.post(url, data=formdata)
print(r.text)


<b>Problem</b><pre>If GET requests use URL parameters, and POST requests send data as part of the HTTP request body,
why do we need to separate arguments when we can already indicate the type of request, by
using either the requests.get or request.post method?  
for example, if you encounter the “form” tag definition in a page’s source code:
form action="submit.html?type=student" method="post"
</pre>

In [8]:
url = 'http://www.webscrapingfordatascience.com/postform2/'
paramdata = {'name': 'Totally Not Seppe'}
formdata = {'name': 'Seppe'}
r = requests.post(url, params=paramdata, data=formdata)
print(r.text)
print(r.request.url)


<html>
	<body>


<h2>Thanks for submitting your information</h2>

<p>Here's a dump of the form data that was submitted:</p>

<pre>array(1) {
  ["name"]=>
  string(5) "Seppe"
}
</pre>


	</body>
</html>

http://www.webscrapingfordatascience.com/postform2/?name=Totally+Not+Seppe


Handling input types file
<plaintext><form action="upload.php" method="post" enctype="multipart/form-data">
<input type="file" name="profile_picture">
<input type="submit" value="Upload your profile picture">
</form>

In [2]:
import requests
url = 'http://www.webscrapingfordatascience.com/postform2/'
formdata = {'name': 'Seppe'}
filedata = {'profile_picture': open('Capture.jpg', 'rb')}
r = requests.post(url, data=formdata, files=filedata)
print(r.request.headers) #!request header
print(r.headers) #?response header

{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '9548', 'Content-Type': 'multipart/form-data; boundary=cadd04ed30cba9ec763afa7f1e911ddc'}
{'Date': 'Wed, 14 Dec 2022 16:55:24 GMT', 'Server': 'Apache/2.4.41 (Ubuntu)', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '181', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}


<h2>More on headers </h2>
<b>PROBLEM</b> Modifying headers 

In [3]:
import requests
url = 'http://www.webscrapingfordatascience.com/usercheck/'
r = requests.get(url)
# Shows: It seems you are using a scraper
print(r.text)
#!HOw does it knows that we are using a scrapper
print(r.request.headers) #?because of User-Agent header line. requests adds it automaticly

It seems you are using a scraper!
{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


<b>Modifying Headers</b> to blend in

In [4]:
import requests
url = 'http://www.webscrapingfordatascience.com/usercheck/'
my_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
r = requests.get(url, headers=my_headers)
print(r.text)
print(r.request.headers)

Welcome, normal user!
{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36  (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


<b>Referer</b> (mispelling of referrer) Header
<pre>Browsers will include this header to indicate the URL of the web page that
linked to the URL being requested. Some websites will check this to prevent “deep links”
from working. To test this out, navigate to http://www.webscrapingfordatascience.
com/referercheck/ in your browser and click the “secret page” link. You’ll be linked to
another page (http://www.webscrapingfordatascience.com/referercheck/secret.
php) containing the text “This is a totally secret page.” Now try opening this URL directly
in a new browser tab. You’ll see a message “Sorry, you seem to come from another web
page” instead. The same happens in requests:</pre>

In [7]:
import requests
url = 'http://www.webscrapingfordatascience.com/referercheck/secret.php'
r = requests.get(url)
print(r.headers)
print(r.text)
# Shows: Sorry, you seem to come from another web page

{'Date': 'Fri, 02 Dec 2022 14:28:31 GMT', 'Server': 'Apache/2.4.41 (Ubuntu)', 'Content-Length': '45', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}
Sorry, you seem to come from another web page


<pre>When encountering such checks in requests, we can simply spoof the “Referer” header as well:

In [8]:
import requests
url = 'http://www.webscrapingfordatascience.com/referercheck/secret.php'
my_headers = {
'Referer': 'http://www.webscrapingfordatascience.com/referercheck/'
}
r = requests.get(url, headers=my_headers)
print(r.text)

This is a totally secret page


<h2>Redirection </h2>
<pre>Open the page http://www.webscrapingfordatascience.com/redirect/ in your browser.
You’ll see that you’re immediately sent to another page (“destination.php”). Now do
the same again while inspecting the network requests in your browser’s developer
tools (in Chrome, you should enable the “Preserve log” option to prevent Chrome from
cleaning the log after the redirect happens). Note how two requests are being made by
your browser: the first to the original URL, which now returns a 302 status code. This
status code instructs your browser to perform a second request to the “destination.php”
URL. How does the browser know what the URL should be? By inspecting the original
URL’s response, you’ll note that there is now a “Location” response header present,
which contains the URL to be redirected to</pre>


<b>PROBLEM</b> Note that we get the HTTP reply corresponding with the final destination (“you’ve
been redirected here from another page!”). In most cases, this default behavior is quite
helpful: requests is smart enough to “follow” redirects on its own when it receives 3XX
status codes. But what if this is not what we want? What if we’d like to get the contents of the original page? This isn’t shown in the browser either, but there might be a relevant
response content present. What if we want to see the contents of the “Location” and
“SECRET-CODE” headers manually?

In [14]:
import requests
url = 'http://www.webscrapingfordatascience.com/redirect/'
r = requests.get(url)
print(r.text)
print(r.headers)

200
Hello, there -- you've been redirected here from another page!

{'Date': 'Fri, 02 Dec 2022 16:08:23 GMT', 'Server': 'Apache/2.4.41 (Ubuntu)', 'Content-Length': '64', 'Keep-Alive': 'timeout=5, max=99', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}


<b>Prevent auto redirection from requests method</b> set  allow_redirects to False

In [13]:
import requests
url = 'http://www.webscrapingfordatascience.com/redirect/'
r = requests.get(url, allow_redirects=False)
print(r.text)
print(r.headers)

You will be redirected... bye bye!
{'Date': 'Fri, 02 Dec 2022 15:10:53 GMT', 'Server': 'Apache/2.4.41 (Ubuntu)', 'SECRET-CODE': '1234', 'Location': 'http://www.webscrapingfordatascience.com/redirect/destination.php', 'Content-Length': '34', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}


<h2>Authentication</h2>
Finally, let’s take a closer look at the 401 (“Unauthorized”) status code, which
seems to indicate that HTTP provides some sort of authentication mechanism. Indeed,
the HTTP standard includes a number of authentication mechanisms, one of which
can be seen by accessing the URL <pre>http://www.webscrapingfordatascience.com/
authentication/</pre>. You’ll note that this site requests a username and password through
your browser. If you press “Cancel,” you’ll note that the website responds with a 401
(“Unauthorized”) result. Try refreshing the page and entering any username and
password combination.<br>
To encrypt username and password requests provides another
means to do so, using the auth argument: 

In [16]:
import requests
url = 'http://www.webscrapingfordatascience.com/authentication/'
r = requests.get(url, auth=('myusername', 'mypassword'))
print(r.text)
print(r.request.headers)
print(r.headers)

Hello myusername.
You entered mypassword as your password.
{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Basic bXl1c2VybmFtZTpteXBhc3N3b3Jk'}
{'Date': 'Fri, 02 Dec 2022 19:44:37 GMT', 'Server': 'Apache/2.4.41 (Ubuntu)', 'Content-Length': '58', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}


<h2>Dealing with Cookies </h2>

Let’s now go over some examples to learn how we can deal with cookies
in requests. The first example we’ll explore can be found at <a>http://www.
webscrapingfordatascience.com/cookielogin/</a>. You’ll see a simple login page.
After successfully logging in (you can use any username and password in this
example), you’ll be able to access a secret page over at the website <a>http://www.
webscrapingfordatascience.com/cookielogin/secret.php</a>. 

In [6]:
import requests
url = 'http://www.webscrapingfordatascience.com/cookielogin/'
data={
    'username':123124,
    'password':1231241
}
r=requests.post(url,data=data)
print(r.text)


<html>

<body>


You are logged in, you can now see the <a href="secret.php">secret page</a>.


</body>

</html>



Try closing and reopening
your browser (or just open an Incognito or Private Mode browser tab) and accessing
the secret URL directly. You’ll see that the server detects that you’re not sending the
right cookie information and blocks you from seeing the secret code. The same can be
observed when trying to access this page directly using requests:

In [7]:
import requests
resp=requests.get('http://www.webscrapingfordatascience.com/cookielogin/secret.php')
print(resp.text)

Hmm... it seems you are not logged in


Obviously, we need to set and include a cookie. To do so, we’ll use a new argument,
called cookies. Note that we could use the headers argument (which we’ve seen before)
to include a “Cookie” header, but we’ll see that cookies is a bit easier to use, as requests
will take care of formatting the header appropriately. We could fall back on our browser’s developer tools, and
get the cookie from the request headers

In [9]:
import requests
url = 'http://www.webscrapingfordatascience.com/cookielogin/secret.php'
my_cookies = {'PHPSESSID': '2s8s8pmuc6fo3jbb3nsrulqier'}
r = requests.get(url, cookies=my_cookies)
print(r.text)

phpsessid=''' #!PHPSESSID?
<b>PHPSESSID</b> We use the php scripting language to power our examples,
so that the cookie name to identify a user’s session is named “phpSeSSiD”.
Other websites might use “session,” “SeSSiOn_iD,” “session_id,” or any other
name as well. Do note, however, that the value representing a session should
be constructed randomly in a hard-to-guess manner. Simply setting a cookie
“is_logged_in=true” or “logged_in_user=Seppe” would of course be very easy to
guess and spoof'''

#PROBLEM:However, if we’d want to use this scraper later on, this particular session identifier might 
# have been flushed and become invalid.

This is a secret code: 1234


<b>SOLUTION</b> We hence need to resort to a more robust system as follows: we’ll first perform a
POST request simulating a login, get out the cookie value from the HTTP response, and
use it for the rest of our “session.”

In [10]:
url = 'http://www.webscrapingfordatascience.com/cookielogin/'
# First perform a POST request
r = requests.post(url, data={'username': 'dummy', 'password': '1234'})
# Get the cookie value, either from
# r.headers or r.cookies print(r.cookies)
my_cookies = r.cookies
# r.cookies is a RequestsCookieJar object which can also
# be accessed like a dictionary. The following also works:
my_cookies['PHPSESSID'] = r.cookies.get('PHPSESSID')
# Now perform a GET request to the secret page using the cookies
r = requests.get(url + 'secret.php', cookies=my_cookies)
print(r.text)
# Shows: This is a secret code: 1234

This is a secret code: 1234


<b> More complex login (and cookie) flows.</b> 
Above code won't work for this logic. The reason behind this is related to something we’ve seen before:
requests will automatically follow HTTP redirect status codes, but the “Set-Cookie”
response header is present in the response following the HTTP POST request, and not
in the response for the redirected page. We’ll hence need to use the allow_redirects
argument once again:

In [15]:
import requests
url = 'http://www.webscrapingfordatascience.com/redirlogin/'
# First perform a POST request -- do not follow the redirect
r = requests.post(url, data={'username': 'dummy', 'password': '1234'},
allow_redirects=False)
# Get the cookie value, either from r.headers or r.cookies
print(r.cookies)
my_cookies = r.cookies
# Now perform a GET request manually to the secret page using the cookies
r = requests.get(url + 'secret.php', cookies=my_cookies)
print(r.text)

<RequestsCookieJar[<Cookie PHPSESSID=335q142ifhvats5d68q4s8c4dl for www.webscrapingfordatascience.com/>]>
This is a secret code: 1234


As a final example, navigate to <a>http://www.webscrapingfordatascience.com/
trickylogin/</a>. This site works in more or less the same way (explore it in your browser),
though note that the form tag now includes an “action” attribute. We might hence
change our code as follows:

This doesn’t seem to work for this example. The reason for this is that this particular
example also checks whether we’ve actually visited the login page, and are hence not
only trying to directly submit the login information

In [18]:
import requests
url = 'http://www.webscrapingfordatascience.com/trickylogin/'
# First perform a POST request -- do not follow the redirect
# Note that the ?p=login parameter needs to be set
r = requests.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'},
allow_redirects=False)
print(r.status_code)
# Set the cookies
my_cookies = r.cookies
# Now perform a GET request manually to the secret page using the cookies
r = requests.get(url, params={'p': 'protected'}, cookies=my_cookies)
print(r.text)
# Hmm... where is our secret code?

200
You should login first before accessing the protected page!
<br><br>
<form method="post" action="index.php?p=login">
	Username: <input type="text" name="username"><br>
	Password: <input type="password" name="password"><br>
	<input type="Submit" value="Access the secret page">
</form>



This doesn’t seem to work for this example. The reason for this is that this particular
example also checks whether we’ve actually visited the login page, and are hence not
only trying to directly submit the login information. 
<b>This also does not seem to work yet. </b>

In [19]:
import requests
url = 'http://www.webscrapingfordatascience.com/trickylogin/'
# First perform a normal GET request to get the form
r = requests.post(url)
# Then perform the POST request -- do not follow the redirect
r = requests.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'},
allow_redirects=False)
# Set the cookies
my_cookies = r.cookies
# Now perform a GET request manually to the secret page using the cookies
r = requests.get(url, params={'p': 'protected'}, cookies=my_cookies)
print(r.text)

You should login first before accessing the protected page!
<br><br>
<form method="post" action="index.php?p=login">
	Username: <input type="text" name="username"><br>
	Password: <input type="password" name="password"><br>
	<input type="Submit" value="Access the secret page">
</form>



Obviously,
the way that the server would “remember” that we’ve seen the login screen is by setting
a cookie, so we need to retrieve that cookie after the first GET request to get the session
identifier at that moment: <b>Again, this fails... the reason for this (you can verify this as well in your browser) is
that this site changes the session identifier after logging in as an extra security measure</b>

In [20]:
import requests
url = 'http://www.webscrapingfordatascience.com/trickylogin/'
# First perform a normal GET request to get the form
r = requests.post(url)
# Set the cookies already at this point!
my_cookies = r.cookies
# Then perform the POST request -- do not follow the redirect
# We already need to use our fetched cookies for this request!
r = requests.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'},
allow_redirects=False,
cookies=my_cookies)
# Now perform a GET request manually to the secret page using the cookies
r = requests.get(url, params={'p': 'protected'}, cookies=my_cookies)
print(r.text)
# Still no secret?

You should login first before accessing the protected page!
<br><br>
<form method="post" action="index.php?p=login">
	Username: <input type="text" name="username"><br>
	Password: <input type="password" name="password"><br>
	<input type="Submit" value="Access the secret page">
</form>



In [21]:
import requests
url = 'http://www.webscrapingfordatascience.com/trickylogin/'
# First perform a normal GET request to get the form
r = requests.post(url)
# Set the cookies
my_cookies = r.cookies
print(my_cookies)
# Then perform the POST request -- do not follow the redirect
# Use the cookies we got before
r = requests.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'},
allow_redirects=False,
cookies=my_cookies)
# We need to update our cookies again
# Note that the PHPSESSID value will have changed
my_cookies = r.cookies
print(my_cookies)
# Now perform a GET request manually to the secret page
# using the updated cookies
r = requests.get(url, params={'p': 'protected'}, cookies=my_cookies)
print(r.text)
# Shows: Here is your secret code: 3838.

<RequestsCookieJar[<Cookie PHPSESSID=ovja5vqlqqt60v5otrqpppmibj for www.webscrapingfordatascience.com/>]>
<RequestsCookieJar[<Cookie PHPSESSID=jodc4qon1qascoosmc41uaa6me for www.webscrapingfordatascience.com/>]>
Here is your secret code: 3838.


<h2>Dealing with Sessions </h2>

You’ll notice a few things going on here: first, we’re creating a requests.Session
object and using it to perform HTTP requests, using the same methods (get, post) as
above. The example now works, without us having to worry about redirects or dealing
with cookies manually. <b>SESSIONS:</b> It
specifies that various requests belong together — to the same session — and that
requests should hence deal with cookies automatically behind the scenes. 

In [4]:
import requests
url = 'http://www.webscrapingfordatascience.com/trickylogin/'
my_session = requests.Session() #fetches session 
r = my_session.post(url)
r = my_session.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'})
r = my_session.get(url, params={'p': 'protected'})
print(r.text)
# Shows: Here is your secret code: 3838.

Here is your secret code: 3838.


Note that sessions also offers an additional benefit apart from
dealing with cookies: if you need to set global header fields, such as the “User-Agent”
header, this can simply be done once instead of using the headers argument every time
to make a request:

In [5]:
import requests
url = 'http://www.webscrapingfordatascience.com/trickylogin/'
my_session = requests.Session()
my_session.headers.update({'User-Agent': 'Chrome!'}) #create global session
# All requests in this session will now use this User-Agent header:
r = my_session.post(url)
print(r.request.headers)
r = my_session.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'})
print(r.request.headers)
r = my_session.get(url, params={'p': 'protected'})
print(r.request.headers)

{'User-Agent': 'Chrome!', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '0'}
{'User-Agent': 'Chrome!', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'PHPSESSID=s21tc01rr2679r8lhm8kfgqv8c'}
{'User-Agent': 'Chrome!', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'PHPSESSID=93qrfo3v9pnf998o98i0qfsfvg'}


<h2>Binary, JSON, and Other Forms of Content </h2>So far, we’ve only used requests to fetch simple
textual or HTML-based content, though remember that to render a web page, your web
browser will typically fire off a lot of HTTP requests, including requests to fetch images.
Additionally, files, like a PDF file, say, are also downloaded using HTTP requests.
To explore how this works in requests, we’ll be using an image containing a lovely
picture of a kitten at <a>http://www.webscrapingfordatascience.com/files/kitten.jpg</a>.
You might be inclined to just use the following approach:

<h3>Binary Format (Image)</h3>

In [None]:
import requests
url = 'http://www.webscrapingfordatascience.com/files/kitten.jpg'
r = requests.get(url)
print(r.text) #!OUTPUT CHARACTERS we’re downloading binary data now, which cannot be represented as Unicode text

Instead of using the text attribute, we need to use content, which returns the contents
of the HTTP response body as a Python bytes object, which you can then save to a file: <b>Don’t Print</b> it’s not a good idea to print out the r.content attribute, as the large amount of text may easily crash your python console window

In [9]:
import requests
url = 'http://www.webscrapingfordatascience.com/files/kitten.jpg'
r = requests.get(url)
with open('image.jpg', 'wb') as my_file:
    my_file.write(r.content)

However, note that when using this method, content() Python will store the full file contents
in memory before writing it to your file. When dealing with huge files, this can easily
overwhelm your computer’s memory capacity. To tackle this, requests also allows to
stream in a response by setting the stream argument to True:

In [None]:
import requests
url = 'http://www.webscrapingfordatascience.com/files/kitten.jpg'
r = requests.get(url, stream=True)
# You can now use r.raw
# r.iter_lines
# and r.iter_content

<ul>Once you’ve indicated that you want to stream back a response, you can work with
the following attributes and methods:
<li>r.raw provides a file-like object representation of the response. This
is not often used directly and is included for advanced purposes.</li>
<li> The iter_lines method allows you to iterate over a content body line</li>
by line. This is handy for large textual responses.
<li> The iter_content method does the same for binary data </li> </ul>
Let’s use iter_content to complete our example above:


In [10]:
import requests
url = 'http://www.webscrapingfordatascience.com/files/kitten.jpg'
r = requests.get(url, stream=True)
with open('image.jpg', 'wb') as my_file:
# Read by 4KB chunks
    for byte_chunk in r.iter_content(chunk_size=4096): #content is divided into chunks and then write in file 
        my_file.write(byte_chunk)

<h3>JSON format </h3>JSON (JavaScript Object Notation), a lightweight textual data interchange format that
is both relatively easy for humans to read and write and easy for machines to parse
and generate. It is based on a subset of the JavaScript programming language, but its
usage has become so widespread that virtually every programming language is able to
read and generate it. You’ll see this format used a lot by various web APIs these days to
provide content messages in a structured way.  To explore an example, head over
to <a>http://www.webscrapingfordatascience.com/jsonajax/</a>

This page shows a simple
lotto number generator. Open your browser’s developer tools, and try pressing the Get lotto numbers” button a few times... By exploring the source code of the page, you’ll
notice a few things going on:
<ul>
<li>Even though there’s a button on this page, it is not wrapped by a
form tag.</li>
<li>When pressing the button, part of the page is updated without
completely reloading the page</li>
<li>The “Network” tab in Chrome will show that HTTP POST requests are
being made when pressing the button</li>
<li>You’ll notice a piece of code in the page source wrapped inside
script tags</li>
</ul>
This page uses JavaScript inside the script tags to perform so-called AJAX
requests. AJAX stands for Asynchronous JavaScript And XML. lets look at the
HTTP requests it is making to see how it works:
<ul>
<li>POST requests are being made to “results.php”</li>
<li>The “Content-Type” header is set to “application/x-www-formurlencoded,” just like before. The client-side JavaScript will make sure
to reformat a JSON string to a an encoded equivalent.</li>
<li>An “api_code” is submitted in the POST request body</li>
<li>The HTTP response has a “Content-Type” header set to
“application/json,” instructing the client to interpret the result as
JSON data</li>
</ul>
We can just use text
as before and, for example, convert the returned result to a Python structure manually
(Python provides a json module to do so), but requests also provides a helpful json
method to do this in one go:

In [13]:
import requests
url = 'http://www.webscrapingfordatascience.com/jsonajax/results.php'
r = requests.post(url, data={'api_code': 'C123456'})
print(r.request.headers)
print(r.json())
print(r.json().get('results'))

{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '16', 'Content-Type': 'application/x-www-form-urlencoded'}
{'results': [1, 12, 13, 15, 24, 27, 29]}
[1, 12, 13, 15, 24, 27, 29]


There’s one important remark here, however. Some APIs and sites will also use an
“application/json” “Content-Type” for formatting the request and hence submit the
POST data as plain JSON. for instance <a>http://www.webscrapingfordatascience.com/jsonajax/results2.php</a>

In [12]:
import requests
url = 'http://www.webscrapingfordatascience.com/jsonajax/results2.php'
# Use the json argument to encode the data as JSON:
r = requests.post(url, json={'api_code': 'C123456'})
# Note the Content-Type header in the request:
print(r.request.headers)
print(r.json())

{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '23', 'Content-Type': 'application/json'}
{'results': [1, 3, 4, 14, 16, 22, 29]}
